kPUKI  UUV.Ui 


IMIlUfM  PMUIC 


No:  0704-4)19$ 


_ TfLi  mmwm  lOr  M  ttMmCWmi  t^iQfwMUQW  n  <  tp  mvaov  I  how  o«r  nipaftt#.  MMlMf  ttw  <Kt«a  for  wwowom^  >— ffwww  twHfiM  ••«»  *04ir^ 

— trrtfrtt  irtrt  ■.  ■  ■  tufuiHifWi-  Woa  ceiiwMwm  1 1 mm frpi>p» or  w aftit ycwgt »t  t*m 

cdMMM «♦  K»WfviTiqwrWma»ia iMWtitiWH  ^  miwHwi,  to  wpinmoton  hikMupUM  Wrvtcot.  OMtcOrm  aoa  **  loftfnow 

i4«ta  I2CM  Afi»ttq>t»i7vAi23M43<U.PMHOt>*oQn*tOQ<Miw»ymiafOO»uB>|^ri— <■  pr%  Moucqco»mifO«(0704J1tt\  Wo>»^on.  OC  2PMS* 


t,  AGENCY  USE  ONLY  (If  JVf  2.  REPORT  DATE  3.  RCFORT  TYRE  AND  OATES  COVERED 

5/9/97 _ FINAL  TECHNICAL  REPORT  (3/1/94  -  9/7R/Q7) 


4,  TITLE  AND  SUtmiE  5-  WMOING  NUMBERS 

TITLE;  Dynamic  Networks  Techniques  for  Autonomous  Planning 
and  Control 

SUBTITLE:  Probabilistic  Counterf actuals  G  -  F49620-94-1-0173 

t.  authonIs)  I  A  I 


Professor  Judea  Pearl 


7.  RERFORMING  ORGANIZATION  NAMC($)  AND  AODRE5S(E5) 

UCLA 

Computer  Science  Department 

4532  Boelter  Hall 

Los  Angeles,  CA  90095-1596 


f.  SPONSORING /MONITORING  AGENCY  NAME(5)  AND  ADORES5(E5) 

U.S.  Air  Force  /  Office  of  Scientific  Research 
110  Duncan  Avenue,  Suite  B115 
Bolling  Air  Force  Base 

Washington,  DC  20332-0001  \  \  ' 


B,  PERFORMING  ORGANIZATION 
RtPORT  NUMBER 

932213-00-A03 

442510-22540 


10.  SOONSOMNe/MQNirOMNQ 
AGENCY  REPORT  NUMtER 


(o^\ 


tZB.  DISTRIBUTION /AVAILABHITY  STATEMENT 


12N.  DISTRIBUTION  CODE 


Approved  for  public  release, 
is  unlimited. 


Distribution 


13.  ABSTRACT  rM4#i#nym7C»wi>«) 

We  have  reformulated  Bayesian  networks  as  carriers  of  causal  information.  Teh  result 
is  a  more  natural  understanding  of  what  the  networks  stand  for,  what  judgments  are  re¬ 
quired  in  constructing  the  network  and,  most  importantly,  how  actions  and  plans  are  to  be 
handled  within  the  framework  of  standard  probability  theory.  Starting  with  functional 
description  of  physical  mechanisms,  we  were  able  to  derive  the  standard  probabilistic 
properties  of  Bayesian  networks  and  to  show: 

*  how  the  effects  of  unanticipated  actions  can  be  predicted  from  the  network  topology, 

*  how  qualitative  causal  judgments  can  be  integrated  with  statistical  data, 

*  how  actions  interact  with  observations,  and 

*  how  counterf actuals  sentences  can  be  formulated  and  evaluated. 


DTIC  QUALITY  INSPECTED  4 


14.  SUCJEa  TERMS 


IS.  NUMSIR  OF  MGtS 


Keywords:  Causation,  counterf actuals,  Bayesian  networks 


17.  security  OASStriCATlON  I  IB.  SECURITY  aASSlffCADOfTTll.  SIOJR^Y  OASSIFICATtON  I  ZO*  UMiTATION  OF  ABSTRACT 
OF  REPORT  I  OF  INIS  PAGE  I  OF  ABiTRACT  I 


unclassif ed 


NSN  7540-01-280^5500 


unclassified 


unclassified 


StBndard  Form  298  IR«v.  2-89) 


DYNAMIC  NETWORKS  TECHNIQUES  FOR 
AUTONOMOUS  PLANNING  AND  CONTROL 


Professor  Judea  Pearl 
Principal  Investigator 
Cognitive  Systems  Laboratory 
Computer  Science  Department 
University  of  California,  Los  Angeles,  CA  90024-1596,  USA 
judea@cs.ucla.edu 


FINAL  TECHNICAL  REPORT:  3/1/94-5/31/97 
AWARD  NO:  F49620-94-1-0173 


1 


19971204  078 


Objectives: 

•  Real-time  planning  under  uncertainty  using  qualitative  approximations  of  probabilities  and 
utilities. 

•  Learning  network  structures  from  empirical  data,  including  the  identification  of  hidden 
causes  and  stable  mechanisms,  by  local  analysis. 


Status  of  effort: 

To  facilitate  the  construction  of  practical  decision  making  systems,  we  havnilocused  our  research 
effort  toward  basing  Bayesian  networks  directly  on  causal  relationships.  The  result  is  a  more  nat¬ 
ural  understanding  of  what  the  networks  stand  for,  what  judgments  are  required  in  constructing 
the  network  and,  most  importantly,  how  actions  and  plans  are  to  be  handled  within  the  frame¬ 
work  of  standard  probability  theory.  Starting  with  functional  description  of  physical  mechanisms, 
we  were  able  to  derive  the  standard  probabilistic  properties  of  Bayesian  networks  and  to  show, 
additionally: 

•  how  the  effects  of  unanticipated  actions  can  be  predicted  from  the  network  topology, 

•  how  qualitative  causal  judgments  can  be  integrated  with  statistical  data, 

•  how  actions  interact  with  observations,  and 

•  how  counterfactuals  sentences  can  be  formulated  and  evaluated. 

Additionally,  we  have  established  an  axiomatic  characterization  of  causal  dependencies,  anal¬ 
ogously  to  the  graphoid  characterization  of  informational  dependencies.  Finally,  we  have  demon¬ 
strated  that  network-learning  techniques,  in  the  presence  of  hidden  variables,  have  enormous 
scope  of  new  applications,  ranging  from  skill  acquisition  by  autonomous  agents,  to  the  analysis  of 
treatment  effectiveness  in  clinical  trials. 

Our  research  has  shown  the  feasibility  of  predicting  the  merit  of  an  action  or  a  plan  by 
observing  the  performance  of  other  agents,  for  example,  watching  the  sequence  of  requests  provided 
by  a  user  of  a  computer  system  or  the  sequence  of  actions  taken  by  a  skilled  operator  of  a  complex 
system. 


Accomplishments/ New  Findings: 

The  following  specific  results  were  obtained  during  the  period  of  performance: 

•  Graphical  criteria  were  developed  for  identifying  conditional  independence  relationships  in¬ 
duced  by  systems  with  feedback  (Pearl  &  Dechter  1996). 

•  Computer  programs  were  developed  to  assist  clinicians  with  assessing  the  efficacy  of  treat¬ 
ments  in  experimental  studies  for  which  subject  compliance  is  imperfect  (Chickering  &  Pearl 
1996). 


•  Axiomatic  characterization  was  given  for  causal-relevance  relationships  of  the  form:  “Chang¬ 
ing  X  will  not  affect  Y  if  we  hold  Z  constant”  (Galles  &  Pearl  1996). 

•  The  notion  of  “identification”  was  extended  to  non-parametric  systems  (Pearl  1995b)  and 
techniques  were  developed  for  non-parametric  identification  of  cause-effect  relationships  from 
nonexperimental  data  (Pearl  1995a;  Pearl  &  Robins  1995;  Balke  &  Pearl  1995;  Galles  &  Pearl 
1995). 

•  Deriving  algebraic  expressions  for  identifiable  causal  effects  (both  total  and  direct)  in  non- 
parametric  structural  models  with  latent  variables. 

•  Selecting  suflScient  set  of  measurements  (covariates  or  confounders)  that  permit  unbiased 
estimation  of  causal  effects  in  observational  studies. 

•  Predicting  (or  bounding)  treatment  effectiveness  from  trials  with  imperfect  compliance. 

•  Estimating  (or  bounding)  counterfactual  probabilities  from  statistical  data  (e.g.,  John,  who 
was  treated  and  died,  would  have  had  90%  chance  of  survival  had  he  not  been  treated) 

•  A  formal  model  has  been  developed,  based  on  dynamic  structural  equations,  which  general¬ 
izes  and  unifies  the  structural  and  counterfactual  approaches  to  causal  inference,  explicates 
their  conceptual  and  mathematical  bases  and  resolves  their  technical  difficulties.  A  sim¬ 
ple  rule  was  devised  for  translating  a  problem  back  and  forth,  between  the  structural  and 
counterfactual  representations,  and  choose  the  one  most  appropriate  for  analysis. 

•  It  has  been  proven  that  the  structural  and  counterfactual  formalisms  are  equivalent  in  re¬ 
cursive  causal  models  (i.e.,  systems  without  feedback)  but  not  when  feedback  is  considered 
possible. 


Personnel  Supported: 

Principal  Investigator: 

Judea  Pearl 
Post-Docs: 

Adnan  Darwiche 
Graduate  Students: 

Alex  Balke  (PhD,  1995),  “Probabilistic  Counterfactuals:  Semantics,  Computation, 
and  Applications” 

David  Chickering  (PhD,  1996),  “Learning  Bayesian  Networks  from  Data” 

David  Galles  (PhD,  expected  June  1997),  “Causal  Theories:  A  Formalism  for  Modeling 
Action  and  Intervention” 

Huy  Cao 
Kevin  Chang 
Research  Associates: 

Rina  Dechter 
Avi  Dechter 
Norman  Dalkey 


Publications  (3/1/94-2/28/97): 

Pearl,  J.,  “From  Imaging  and  Stochastic  Control  to  a  Calculus  of  Actions,”  Symposium  Notes  of 
the  1994  AAAI  Spring  Symposium  on  Decision- Theoretic  Planning^  Stanford,  CA,  204-209, 
March  21-23,  1994. 

Pearl,  J.,  “From  Adams’  conditionals  to  default  expressions,  causal  conditionals,  and  counterfac- 
tuals,”  in  E.  Eells  and  B.  Skyrms  (Eds.),  Probability  and  Conditionals,  Cambridge  University 
Press,  New  York,  NY,  47-74,  1994. 

Pearl,  J.  and  N.  Wermuth,  “When  Can  Association  Graphs  Admit  A  Causal  Explanation?,”  in 
P.  Cheeseman  and  W.  Oldford  (Eds.),  Selecting  Models  and  Data,  Artificial  Intelligence  and 
Statistics  IV,  Springer- Verlag,  205-214,  1994. 

Darwiche,  A.  and  J.  Pearl,  “On  the  Logic  of  Iterated  Belief  Revision,”  in  R.  Fagin  (Ed.),  Proceed¬ 
ings  of  the  1994  Conference  on  Theoretical  Aspects  of  Reasoning  about  Knowledge  (TARK 
’94),  Pacific  Grove,  CA,  5-23,  Mar.  1994.  To  appear  in  Artificial  Intelligence,  Spring  1997. 

Darwiche,  A.  and  J.  Pearl,  “Symbolic  Causal  Networks  for  Planning  under  Uncertainty,”  In 
Proceedings  of  the  Twelfth  National  Conference  on  Artificial  Intelligence  (AAAI-94),  Seattle, 
WA,  Volume  I,  238-244,  1994. 

Tan,  S-W.,  “Qualitative  Decision  Theory,”  In  Proceedings  of  the  Twelfth  National  Conference 
on  Artificial  Intelligence  (AAAI-94),  Seattle,  WA,  Volume  II,  928-933,  July  31  -  August  4, 
1994. 

Tan,  S-W.  and  J.  Pearl,  “Specification  and  Evaluation  of  Preferences  for  Planning  under  Un¬ 
certainty,”  in  J.  Doyle,  E.  Sandewall,  and  P.  Torasso  (Eds.),  Proceedings  of  the  Fourth 
International  Conference,  Principles  of  Knowledge  Representation  and  Reasoning  (KR-94), 
Bonn,  Germany,  Morgan  Kaufmann,  San  Francisco,  CA,  530-539,  May  1994. 

Pearl,  J.,  “A  Probabilistic  Calculus  of  Actions,”  In  R.  Lopez  de  Mantaras  and  D.  Poole  (Eds.), 
Proceedings  of  the  Tenth  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-94),  Mor¬ 
gan  Kaufman,  San  Mateo,  CA,  454-462,  1994. 

Balke,  A.,  and  Pearl,  J.,  “Counterfactual  Probabilities:  Computational  Methods,  Bounds,  and 
Applications,”  in  R.  Lopez  de  Mantaras  and  D.  Poole  (Eds.),  Proceedings  of  the  Conference 
on  Uncertainty  in  Artificial  Intelligence  (UAI-94),  Morgan  Kaufmann,  San  Mateo,  CA, 
46-54,  July  29-31,  1994. 

Tan,  S-W.  and  Pearl,  J.,  “Exceptional  Subclasses  in  Qualitative  Probability,”  in  R.  Lopez  de 
Mantaras  and  D.  Poole  (Eds.),  Proceedings  of  the  Tenth  Conference  on  Uncertainty  in  Arti¬ 
ficial  Intelligence  (UAI-94),  Morgan  Kaufmann,  San  Mateo,  CA,  553-559,  July  29-31, 1994. 

Balke,  A.  and  Pearl,  J.,  “Probabilistic  Evaluation  of  Counterfactual  Queries,”  in  Proceedings  of 
the  Twelfth  National  Conference  on  Artificial  Intelligence  (AAAI-94),  Seattle,  WA,  Volume 
I,  230-237,  July  31  -  August  4,  1994. 


Pearl,  J.  and  Verma,  T.,  “A  Theory  of  Inferred  Causation,”  in  D.  Prawitz,  B.  Skyrms  and  D. 
Westertahl  (Eds.),  Logic,  Methodology  and  Philosophy  of  Science  IX,  Elsevier  Science  B.V., 
789-811,  1994. 

Balke,  A.  and  Pearl,  J.,  “Universal  Formulas  for  Treatment  Effects  from  Noncompliance  Data,”  in 
N.P.  Jewell,  A.C.  Kimber,  M.-L.T.  Lee,  and  G.A.  Whitmore  (Eds.),  Lifetime  Data:  Models 
in  Reliability  and  Survival  Analysis,  Kluwer  Academic  Publishers,  Dordrecht,  39-43,  1995. 

Pearl,  J.,  “Causal  Inference  from  Indirect  Experiments,”  Artificial  Intelligence  in  Medicine,  Vol. 
7,  No.  6,  561-582,  1995. 

Pearl,  J.,  “On  the  Testability  of  Causal  Models  with  Latent  and  Instrumental  Variables,”  in  P. 
Besnard  and  S.  Hanks  (Eds.),  Uncertainty  in  Artificial  Intelligence  11,  Morgan  Kaufmann, 
San  Francisco,  CA,  435-443,  1995. 

Pearl,  J.  and  Robins,  J.,  “Probabilistic  evaluation  of  sequential  plans  from  causal  models  with 
hidden  variables,”  in  P.  Besnard  and  S.  Hanks  (Eds.),  Uncertainty  in  Artificial  Intelligence 
11,  Morgan  Kaufmann,  San  Francisco,  CA,  444-453,  1995. 

Pearl,  J.,  “Causation,  Action,  and  Counterfactuals,”  in  A.  Gammerman  (Ed.),  Computational 
Learning  and  Probabilistic  Reasoning,  John  Wiley  and  Sons,  New  York,  Chapter  6,  235-255, 
1995. 

Tan,  S-W.  and  Pearl,  J.,  “Specificity  and  Inheritance  in  Default  Reasoning,”  Proceedings  of  the 
Fourteenth  International  Joint  Conference  on  Artificial  Intelligence  (IIJCAI-95),  Montreal, 
Quebec,  August  20-25,  Vol.  2,  1480-1486,  1995. 

Pearl,  J.,  “From  Bayesian  Networks  to  Causal  Networks,”  in  A.  Gammerman  (Ed.),  Bayesian 
Networks  and  Probabilistic  Reasoning,  Alfred  Walter  Ltd.,  London,  1-31,  1994.  In  Proceed¬ 
ings  of  the  UNICOM  Seminar  on  Adaptive  Computing  and  Information  Processing,  Brunei 
University,  London,  pp.  165-194,  January  25-27,  1994.  Also  in  G.  Coletti,  D.  Dubois,  and 
R.  Scozzafava  (Eds.),  Mathematical  Models  for  Handling  Partial  Knowledge  in  Artificial 
Intelligence,  Plenum  Publishing,  New  York,  NY,  1995. 

Pearl,  J.,  “Causation,  Action,  and  Counterfactuals,”  Extended  Abstract  in  A.  Cohn  (Ed),  11th 
European  Conference  on  Artificial  Intelligence  (ECAI-94),  John  Wiley  and  Sons,  Ltd.,  826- 
828,  1994. 

Galles,  D.  and  Pearl,  J.,  “Testing  Identifiability  of  Causal  Effects,”  In  P.  Besnard  and  S.  Hanks 
(Eds.),  Uncertainty  in  Artificial  Intelligence  11,  Morgan  Kaufmann,  San  Francisco,  CA, 
185-195,  1995. 

Balke,  A.  and  Pearl,  J.,  “Counterfactuals  and  Policy  Analysis  in  Structural  Models,”  in  P. 
Besnard  and  S.  Hanks  (Eds.),  Uncertainty  in  Artificial  Intelligence  11,  Morgan  Kaufmann, 
San  Francisco,  CA,  11-18,  1995. 

Pearl,  J.,  “Bayesian  Networks,”  in  M.  Arbib  (Ed.),  Handbook  of  Brain  Theory  and  Neural  Net¬ 
works,  MIT  Press,  149-153,  1995. 


Pearl,  J.,  “Causal  diagrams  for  empirical  research,  (with  discussion),”  Biometrika,  82(4),  669- 
710,  December  1995. 

Goldszmidt,  M.  and  Pearl,  J.,  “Qualitative  Probabilities  for  Default  Reasoning,  Belief  Revision 
and  Causal  Modeling,”  Artificial  Intelligence,  Vol.  84,  No.  1-2,  57-112,  1996. 

Ben-Eliyahu,  R.  &  Dechter,  R.,  “Default  Reasoning  Using  Classical  Logic,”  Artificial  Intelligence 
Journal,  Volume  84,  Issue  1-2,  113-150,  July  1996. 

Chickering,  D.M.  and  Pearl,  J.,  “A  Clinician’s  Apprentice  for  Analyzing  Non-compliance,”  in 
Proceedings  of  the  National  Conference  on  Artificial  Intelligence  (AAAI-96),  Portland,  OR, 
1269-1276,  August  1996. 

Pearl,  J.  and  Dechter,  R.,  “Identifying  independencies  in  causal  graphs  with  feedback,”  in  Pro¬ 
ceedings  of  Uncertainty  in  Artificial  Intelligence  (UAI-96),  Portland,  OR,  240-246,  August 
1996. 

Pearl,  J.  and  Dechter,  R.,  “Identifying  independencies  in  causal  graphs  with  feedback,”  In  E. 
Horvitz  and  E.  F.  Jensen  (Eds.),  Uncertainty  in  Artificiall  Intelligence,  Proceedings  of  the 
Twelfth  Conference,  Morgan  Kaufmann:  San  Francisco,  CA,  240-246,  August  1996. 

Pearl,  J.,  “Graphical  Models  for  Probabilistic  and  Causal  Reasoning,”  In  Allen  B.  Tucker,  Jr. 
(Ed.),  The  Computer  Science  and  Engineering  Handbook,  Chapter  31,  CRC  Press,  Inc., 
697-714,  1997. 

Pearl,  J.,  “Structural  and  Probabilistic  Causality,”  in  D.R.  Shanks,  K.J.  Holyoak,  and  D.L.  Medin 
(Eds.),  The  Psychology  of  Learning  and  Motivation,  Vol.  3f:  Causal  Learning.  Academic 
Press,  San  Diego,  CA,  393-435,  1996. 

Pearl,  J.  and  Goldszmidt,  M.,  “Probabilistic  Foundations  of  Reasoning  with  Conditionals,”  in 
G.  Brewka  (Ed.),  Principles  of  Knowledge  Representation,  CSLI  Publications,  33-68,  1996. 

Pearl,  J.,  “Causation,  Action,  and  Counterfactuals,”  in  Yoav  Shoham  (Ed.),  Theoretical  As¬ 
pects  of  Rationality  and  Knowledge,  Proceedings  of  the  Sixth  Conference  (TARK  1996),  The 
Netherlands,  51-73,  March  17-20,  1996 

Pearl,  J.,  “Decision  Making  Under  Uncertainty,”  Prepared  for  CRC  Handbook  dasiptex  for  special 
50th-anniversary  issue  of  Computing  Surveys,  1996. 

Meiri,  I.,  “Combining  Qualitative  and  Quantitative  Constraints  in  Temporal  Reasoning,”  Artifi¬ 
cial  Intelligence,  87,  1-46,  1996. 

Dechter,  R.  &  Dechter,  A.,  “Structure-Driven  Algorithms  for  Truth  Maintenance,”  Artificial 
Intelligence,  82,  1-20,  1996. 

Paz,  A.,  Pearl,  J.,  k.  Ur,  S.,  “A  New  Characterization  of  Graphs  Based  on  Interception,”  Journal 
of  Graph  Theory,  Vol.  22,  No.  2,  125-136,  1996. 


Pearl,  J.,  “The  Art  and  Science  of  Cause  and  Effect,”  Given  October  29,  1996  as  part  of  the 
UCLA  81st  Faculty  Research  Lecture  Series. 

Pearl,  J.,  “TETRAD  and  SEM,”  UCLA  Computer  Science  Department,  Technical  Report  (R- 
244),  June  1996.  Commentary  on  “The  TETRAD  Project:  Constraint  Based  Aids  to  Causal 
Model  Specification”  by  R.  Scheines,  P.  Spirtes,  C.  Glymour,  C.  Meek,  and  T.  Richardson. 
Prepared  for  Multivariate  Behavioral  Research. 

Galles,  D.  &  Pearl,  J.,  “Axioms  of  Causal  Relevance,”  Preliminary  version  in  Proceedings  of 
the  Fourth  International  Conference  on  Mathematics  and  AI,  Fort  Lauderdale,  FL,  64-67, 
January,  1996.  Revision  I  submitted  to  Artificial  Intelligence,  November  1996. 

Pearl,  J.,  “Bayesian  Networks,”  UCLA  Computer  Science  Department,  Technical  Report  (R-246), 
November  1996.  To  appear  in  MIT  Encyclopedia  of  the  Cognitive  Sciences. 

Pearl,  J.,  “Comments  on  R.W.  Olfords’  ‘A  Physical  Device  for  Demonstrating  Confounding, 
Blocking,  and  the  Role  of  Randomization  in  Uncovering  a  Causal  Relationship’,”  The  Amer¬ 
ican  Statistician,  Vol.  50,  No.  4,  387-388,  November  1996. 

Pearl,  J.,  “Graphs,  Structural  Models  and  Causality,”  UCLA  Computer  Science  Department, 
Technical  Report  (R-247),  December  1996.  (A  condensed  version  of  this  paper  has  appeared 
in  Biometrika,  82(4),  669-710,  December  1995  under  the  title  “Causal  Diagrams  for  Exper¬ 
imental  Research”.  Prepared  for  AAAI/MIT  Press  volume  of  Causation,  Bayes  Networks, 
and  Machine  Discovery. 

Balke,  A.  &  Pearl,  J.,  “Nonpar ametric  Bounds  on  Causal  Effects  from  Partial  Compliance  Data,” 
UCLA  Computer  Science  Department,  Technical  Report  (R-199),  Revision  II,  November 
1996.  To  appear  in  JASA,  September  1997. 

Darwiche,  A.  &  Pearl,  J.,  “On  the  Logic  of  Iterated  Belief  Revision,”  Artificial  Intelligence,  Vol. 
89,  Nos.  1-2,  1-29,  January  1997. 

Pearl,  J.,  “The  New  Challenge:  From  a  Century  of  Statistics  to  an  Age  of  Causation,”  Proceedings 
the  lASC  Second  World  Congress,  Pasadena,  CA,  February  1997. 

Pearl,  J.,  “On  the  Identification  of  Nonparametric  Structural  Models,”  in  M.  Berkane  (Ed.), 
Latent  Variable  Modeling  with  Application  to  Causality  Conference,  Springer- Verlag,  Lecture 
Notes  in  Statistics,  29-68,  1997. 

Pearl,  J.,  “Causation,  Action,  and  Counterfactuals,”  In  M.L.  Dalla  Chiara  et  al.  (Eds.),  Logic 
and  Scientific  Methods,  Kluwer  Academic  Publishers,  Netherlands,  355-375,  1997. 


Interact  ions/ Transit  ions: 

1994  4th  International  Conference,  Principles  of  Knowledge  Representation  &  Reasoning  (KR-94) 
1994  International  Research  Conference  on  Lifetime  Data  Models  in  Reliability  &  Survival  Anal¬ 
ysis 


1994  Conference  on  Theoretical  Aspects  of  Reasoning  about  Knowledge  (TARK-94) 
1994  AAAI  Spring  Symposium  on  Decision-Theoretic  Planning 
1994  12th  National  Conference  on  Artificial  Intelligence  (AAAI-94) 

1994  Latent  Variable  Modelling  with  Application  to  Causality  Conference 
1994  10th  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-94) 

1994  11th  European  Conference  on  Artificial  Intelligence  (ECAI) 

1995  11th  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-95) 

1995  14th  International  Joint  Conference  on  Artificial  Intelligence  (IJCAI-95) 

1995  Annual  Meeting  of  the  National  Academy  of  Engineering 

1996  International  Symposium  on  AI/Math 

1996  6th  Conference  on  Theoretical  Aspects  of  Rationality  and  Knowledge  (TARK-96) 
1996  National  Conference  on  Artificial  Intelligence  (AAAI-96) 

1996  12th  Conference  on  Uncertainty  in  Artificial  Intelligence  (UAI-96) 


New  discoveries,  inventions,  or  patent  disclosures: 

None. 


Honors  /  Awards: 

Professor  Pearl  is  a  recipient  of  the  RCA  Laboratories  Achievement  Award  (1965),  and  a  NATO 
Senior  Fellowship  in  Science  (1975).  He  is  a  Fellow  of  IEEE,  a  founding  Fellow  of  AAAI,  and  a 
member  of  the  National  Academy  of  Engineering  (NA  E).  Recently  named  UCLA’s  81st  Faculty 
Research  Lecturer. 


University  of  California 

Los  Angeles 


Structural  Causal  Models: 

A  Formalism  for  Reasoning  About  Actions  and 

Counterfact  uals 


A  dissertation  submitted  in  partial  satisfaction 
of  the  requirements  for  the  degree 
Doctor  of  Philosophy  in  Computer  Science 


by 

David  Jerome  Galles 


1997 


©  Copyright  by 
David  Jerome  Galles 


1997 


The  dissertation  of  David  Jerome  Galles  is  approved. 


Sander  Greenland 


Sheila  A.  Greibach 


D.  Stott  Parker 


Judea  Pearl,  Committee  Chair 


University  of  California,  Los  Angeles 

1997 


11 


Table  of  Contents 


1  Causality .  1 

1.1  Introduction .  1 

1.2  Previous  Work  .  2 

1.2.1  Lewis’s  Counterfactuals  .  3 

1.2.2  Iwasaki  and  Simon’s  Causality  in  Device  Behavior .  4 

1.2.3  Balke’s  Probabilistic  Counterfactuals .  5 

1.2.4  Ney man- Rubin’s  Counterfactuals .  6 

1.3  Contributions .  7 

1.4  Overview .  8 

2  Causal  Models .  9 

2.1  Introduction .  9 

2.2  Equation  Set  .  9 

2.3  Valid  Causal  Models .  11 

2.4  Interventions  . 15 

2.5  Probabilistic  Causal  Models .  18 

2.6  Examples .  20 

2.6.1  Sprinkler  Example .  20 

2.6.2  Policy  Analysis  in  Linear  Econometric  Models .  23 

2.6.3  Linguistic  Notions  of  Causality  .  25 

iii 


2.7  Properties  of  Counterfactual  Statements .  27 

2.7.1  Definitions  of  Causal  Properties .  27 

2.7.2  Soundness  of  Composition,  Effectiveness,  and  Reversibility  30 

2.7.3  Independence  of  Effectiveness,  Composition,  and  Re¬ 
versibility  .  32 

2.7.4  Completeness  of  Causal  Properties .  32 

2.8  Comparison  to  Lewis’s  Formalism  .  39 

2.9  Applying  Counterfactual  Derivation:  Example .  43 

2.10  Conclusion .  43 

3  Dynamic  Causal  Models .  50 

3.1  Introduction .  50 

3.2  Causal  Models  with  Memory .  50 

3.3  Time-Series  Causal  Model .  56 

3.4  Conclusion .  60 

4  Causal  Relevance  .  61 

4.1  Introduction .  61 

4.2  Probabilistic  Causal  Irrelevance .  63 

4.2.1  Comparison  to  Informational  Relevance .  64 

4.2.2  Axioms  of  Probabilistic  Causal  Irrelevance .  67 

4.3  Proofs  of  Axioms  of  Probabilistic  Causal  Irrelevance .  69 

4.3.1  Counterexample  to  Property  2.2.2  .  70 


IV 


4.3.2  Numeric  Constraints .  yj 

4.3.3  Axioms  of  Causal  Relevance  for  Stable  Models  .  72 

4.4  Deterministic  Causal  Relevance .  75 

4.4.1  Axioms  of  Causal  Irrelevance .  77 

4.4.2  Proofs  of  Causal  Irrelevance  Axioms  .  78 

4.4.3  Why  Transitivity  Fails  in  Causal  Relevance .  80 

4.4.4  Causal  Relevance  and  Directed  Graphs .  81 

4.5  Applications  of  Deterministic  Causal  Relevance .  85 

4.6  Conclusion .  33 

5  Identifying  Causal  Effects .  92 

5.1  Introduction .  92 

5.2  Identifiability  in  Econometrics .  94 

5.3  Identification  in  Causal  Models . 95 

5.4  Notation  and  Definitions  .  96 

5.5  Action  Calculus .  93 

5.6  A  Graphical  Criterion  for  Testing  Identifiability . 101 

5.7  Remarks  on  Efficiency . HI 

5.8  Complexity  Analysis . 119 

5.9  Deriving  a  Closed-Form  Expression  for  Control  Queries . 120 

5.10  Conclusion . 121 


6  Conclusion 


123 


A  Counterexamples 


.  125 


References 


.  132 


VI 


List  of  Figures 


Ll  Graphical  representation  of  A.  O — >  B  in  Lewis's  formalism 

2.1  An  equation  set  that  is  not  a  valid  causal  model 

2.2  A  valid  non-recursive  causal  model . 

2.3  Causal  graph  illustrating  causal  relationships  among  five  variables 

2.4  Causal  graph  illustrating  the  relationship  between  supply  and  de¬ 
mand  . 

2.5  Example  of  the  failure  of  reversibility  in  Lewis’s  framework:  W  = 

w  holds  in  all  closest  ?/-worlds,  and  Y  =  y  holds  in  all  closest 
u;-worlds,  yet  F  ^  F  and  W  . 

2.6  Causal  graph  illustrating  the  effect  of  smoking  on  lung  cancer  .  . 

3.1  Example  of  a  causal  model  with  memory . 

3-2  We  can  ensure  that  a  pa  ^  =  Xi  in  a  causal  model  with  memory 
by  combining  variables,  in  the  extreme  case,  to  a  single  variable. 

3.3  A  causal  model  with  memory  which  contains  variables  whose  val¬ 
ues  depend  upon  the  past  history  of  other  variables . 

3.4  A  causal  model  with  memory  which  contains  variables  whose  val¬ 
ues  do  not  depend  upon  the  past  histories  of  other  variables  .  ,  . 

4.1  The  graphoid  axioms . 

4.2  Example  of  P(y)  >  MAXa;P(j/|x) . 

4.3  Counterexample  to  property  2.2.2 . 


4.4  Sound  and  complete  axioms  for  path-interception  in  directed 

graphs . . .  75 

4.5  Example  of  a  causal  model  that  requires  the  examination  of  sub¬ 
models  before  causal  relevance  can  be  determined .  77 

4.6  Counterexample  to  transitivity  in  causal  irrelevance .  80 

4.7  Transitivity  fails,  even  when  a  variable  is  more  completely  con¬ 
trolled  by  its  parents .  91 

5.1  Illustrating  Condition  3  of  Theorem  12.  In  a,  the  set  {81,82} 

blocks  all  back-door  paths  from  X  to  F  and  P(6i, 62!^)  =  Pihih)- 
In  b,  the  node  8  blocks  all  back-door  paths  from  X  to  Y,  and 
P{b\x)  is  identifiable  using  Condition  4 . 103 

5.2  Illustrating  Condition  4  of  Theorem  12.  (a),  Zi  blocks  all  directed 
paths  from  X  to  Y,  and  the  empty  set  blocks  all  back-door  paths 
from  Zi  to  Y  in  Gy  and  all  back-door  paths  from  X  to  Zi  in  G', 

(b,c)  Zi  blocks  all  directed  paths  from  X  to  Y,  and  Z2  blocks  all 
back-door  paths  from  Zi  to  Y  in  Gy  and  all  back-door  paths  from 

^  to  Zi  in  G . 104 

5.3  Using  Rule  2  to  remove  the  hat  from  X  when  the  criterion  fails: 
since  Z  is  necessary,  there  must  be  a  directed  path  from  (a)  Z  to 

Y  or  (b)  Z  to  ^ . 108 

5.4  Theorem  12  ensures  a  reducing  sequence  for  P{y2\x,yi)  and 

P(?/i|x),  although  none  exists  for  P(yi|:r,y2)  •  •  •  . . 112 

5.5  If  a  member  of  K  blocks  a  back-door  path  from  X  io  Y  and  is  a 

descendant  of  X,  then  it  is  also  an  ancestor  of  F . 113 


vm 


5.6  Examples  of  the  two  cases  for  if' . 

5.7  There  must  exist  a  member  B'  of  B  which  blocks  the  back-door 

path  from  X  to  J . 

5.8  B  can  be  between  either  (a)  X  and  K\  or  (b)  K'  and  Y. 

A.l  Counterexample  to  property  2.2.3  . 

A. 2  Counterexample  to  property  2.2.4  . 

A. 3  Counterexample  to  property  2.3  . 

A. 4  Counterexample  to  property  2.4  . 

A. 5  Counterexample  to  property  2.5.1  . 

A.6  Counterexample  to  2.5.1,  such  that  each  variable  in  17  has  a  single 
child . 

A. 7  Counterexample  to  property  2.6 . 


114 

115 

116 

125 

126 

127 

128 

129 

130 

131 


IX 


List  of  Tables 


2.1  Table  of  counterfactual  statements  where  composition  and  effec¬ 
tiveness  hold  but  reversibility  does  not .  33 

2.2  Table  of  counterfactual  statements  where  effectiveness  and  re¬ 
versibility  hold  but  composition  does  not  . .  33 

2.3  Table  of  counterfactual  statements  where  composition  and  re¬ 
versibility  hold  but  effectiveness  does  not  .  34 


X 


Acknowledgments 


I  would  like  to  thank  my  advisor,  Judea  Pearl.  I  was  blessed  to  be  able  to  work 
with  a  researcher  of  his  caliber.  The  members  of  my  committee,  D.  Stott  Parker, 
Sheila  Greibach,  and  Sander  Greenland,  deserve  thanks  for  valuable  insights  that 
added  greatly  to  the  quality  of  this  document.  I  would  also  like  to  thank  Kaoru 
Mulvihill  for  frequently  coming  to  my  aid  as  deadlines  loomed.  Along  with  every 
graduate  of  the  UCLA  Computer  Science  Department,  I  am  indebted  to  Verra 
Morgan  for  ensuring  that  I  jumped  through  all  of  the  requisite  hoops  along  the 
way  to  graduation.  Finally,  my  deepest  gratitude  goes  to  my  wife,  Julie.  Without 
her  constant  support  and  encouragement,  I  would  never  have  finished  this  degree. 


XI 


Vita 


1969 

1989-1992 

1992 

1993-1995 

1994 

1995-1997 

1995-1997 


Born,  Walnut  Creek,  California 

Section  Leader,  Computer  Science  Department,  Stanford  Uni¬ 
versity,  Palo  Alto,  California. 

B.S.  (Computer  Science),  with  Distinction,  Stanford  Univer¬ 
sity,  Palo  Alto,  California. 

Teaching  Assistant,  Computer  Science  Department,  University 
of  California,  Los  Angeles. 

M.S.  (Computer  Science),  University  of  California,  Los  Ange¬ 
les. 

Research  Assistant,  Computer  Science  Department,  University 
of  California,  Los  Angeles. 

Instructor,  Loyola- Marymount  site,  Center  for  Talented  Youth, 
Johns  Hopkins  University. 


xii 


Abstract  of  the  Dissertation 


Structural  Causal  Models: 

A  Formalism  for  Reasoning  About  Actions  and 

Counterfactuals 

by 

David  Jerome  Galles 

Doctor  of  Philosophy  in  Computer  Science 
University  of  California,  Los  Angeles,  1997 
Professor  Judea  Pearl,  Chair 

Our  everyday  language  is  permeated  with  causal  utterances.  In-depth  under¬ 
standing  of  an  event  or  action  is  associated  with  comprehension  of  the  causal 
mechanisms  that  led  to  the  event  or  governed  the  action.  This  dissertation  gives 
formal  underpinnings  to  some  of  the  intuitive  notions  of  causation,  in  the  lan¬ 
guage  of  structural  causal  models.  This  formalism,  which  is  tailored  after  the 
structural  equation  model  of  engineering  and  economics,  provides  a  language  for 
specifying  the  precise  meaning  of  concepts  such  as  influence,  causal  relevance, 
counterfactuals,  probability  of  causes  and  probability  of  effects.  We  model  ac¬ 
tions  as  local  modifications  of  causal  models,  that  is,  the  replacement  of  a  set  of 
equations  with  other  equations. 

We  sharpen  our  understanding  of  causality  by  developing  an  axiomization  for 
causal  relevance  in  causal  models.  We  examine  two  versions  of  conditional  causal 
irrelevance:  probabilistic,  “If  we  hold  Z  constant,  changing  X  cannot  change  the 


xm 


probability  of  Y”  and  deterministic,  “If  we  hold  Z  constant,  changing  X  cannot 
alter  the  value  of  Y  in  any  circumstance.”  Comparison  of  the  behavior  of  these 
two  types  of  relevance  to  the  behavior  of  conditional  independence  reveals  some 
of  the  fundamental  differences  between  causation  and  correlation. 

Finally,  we  demonstrate  how  causal  models  can  be  used  to  calculate  quantities 
of  interest.  We  provide  a  polynomial-time  algorithm  that  decides,  given  the 
graph  associated  with  a  causal  model,  whether  the  causal  effect  of  one  variable 
on  another  can  be  determined  from  data  obtained  under  controlled  conditions. 
Whenever  such  a  determination  is  feasible,  then  the  algorithm  yields  a  closed- 
form  expression  for  the  causal  effect. 


XIV 


CHAPTER  1 


Causality 


1.1  Introduction 

Causation  is  ubiquitous  in  everyday  language.  For  instance,  when  asked  why 
there  is  a  stain  on  the  carpet,  one  might  reply  “I  knocked  over  my  coffee  mug.” 
This  is  a  causal  explanation  of  the  event — it  implies  a  chain  of  events,  each  of 
which  is  the  direct  cause  of  the  next:  knocking  over  the  mug  caused  the  coffee  in 
the  mug  to  spill,  and  the  coffee  that  spilled  on  the  floor  stained  the  carpet.  On 
any  given  day,  we  make  dozens  of  such  explanations. 

Given  the  prevalence  of  causality  and  causal  language  in  everyday  language, 
we  would  expect  scientific  discourse  to  be  laced  with  causal  concepts.  However, 
when  we  look  at  statistics,  which  is  the  formal  language  commonly  used  by  scien¬ 
tists,  we  find  a  notable  absence  of  causal  concepts.  The  same  is  true  for  standard 
first-order  logic,  where  there  is  no  distinction  between  causation  and  implication. 
Why  is  causality,  which  we  find  so  useful  in  everyday  life,  missing  from  the  formal 
languages  of  statistics  and  logic? 

One  is  the  lack  of  rigorous,  mathematical,  and  useful  definitions  of  causal 
terms.  This  dissertation  seeks  to  offer  rigorous  definitions  of  causal  terms  such  as 
influence,  causal  irrelevance,  direct  effect.  Of  course,  on  their  own  definitions  are 
of  little  use.  The  point  of  crafting  these  definitions,  and  ensuring  that  they  have 


1 


3.  rigorous  m3/thGin3.tic3.1  b3,sis,  is  to  rn^nipulSftc  thcin.  Giving  Cctiiss.1  l&ngu3,gG 
precise  meaning  will  allow  us  to  develop  a  calculus  for  answering  queries  about 
causation  and  to  create  useful  tools  for  policy  analysis  as  well  as  world  modeling. 

The  formal  language  that  we  will  use  to  obtain  a  specification  of  causality 
is  that  of  structural  causal  models,  which  will  be  crafted  and  interpreted  after 
the  structual  equation  models  used  in  engineering  and  statistics.  Thoughout  this 
thesis,  we  will  use  the  abbreviations  structural  model  qnd  causal  model  to  empha¬ 
size  different  aspects  of  structural  causal  models.  Causal  models  offer  a  succinct 
language  for  discussing  the  effects  of  actions  in  the  world.  As  such,  they  are  a 
useful  tool  for  modeling  interventions  in  a  wide  range  of  applications.  After  devel¬ 
oping  the  theoretical  foundations  of  causal  models,  this  dissertation  will  provide 
a  mathematical  characterization  of  causal  relevance,  that  is  akin  to  the  work  in 
graphoids  on  observational  relevance.  Various  useful  extensions  to  causal  models 
which  allow  for  time- varying  as  well  as  steady-state  systems  will  be  investigated. 
Finally,  practical  applications  of  causal  models  will  be  demonstrated. 

1.2  Previous  Work 

When  developing  a  formalism  for  describing  and  reasoning  about  causality,  re¬ 
searchers  often  present  the  concept  of  causality  within  the  framework  of  causal 
counterfactuals.  A  counterfactual  statement  takes  the  form,  “If  A  were  true,  then 
B  would  be  true  as  well.”  Much  of  the  work  in  evaluating  these  counterfactu¬ 
als  involves  considering  the  minimal  change  needed  to  make  A  true  and  testing 
whether  B  is  then  true  as  well.  The  tricky  part  is  defining  what  exactly  is  meant 
by  “minimal  change.” 


2 


1.2.1  Lewis’s  Counterfactuals 


Lewis  approaches  the  concept  of  a  minimal  change  through  the  concept  of  a 
closest  world  [Lew73a].  Lewis  defines  the  counterfactual  statement  “If  it  were 
the  case  that  then  it  would  be  the  case  that  5,”  written  A  □->  5,  as  true  at 
world  i  iff  some  accessible  ^B-world  is  closer  to  i  than  any  AB  world,  if  there  are 
any  A-worlds  accessible  from  i.  Thus,  given  any  world  i,  we  can  imagine  spheres 
of  similarity  around  i.  A  set  of  worlds  S  is  s,  sphere  around  i  iff  every  iS-world  is 
accessible  from  i  and  is  closer  to  i  than  any  world  not  in  S.  We  call  a  sphere  S 
A-permitting  if  there  exists  some  world  in  S  such  that  A  holds.  Thus,  AD^  B 
holds  at  i  iff  there  exists  some  A-permitting  sphere  S  such  that  A-^  B  in  every 
member  of  S.  This  is  shown  graphically  in  Figure  1.1. 


Figure  1.1:  Graphical  representation  of  A  □-+  B  in  Lewis’s  formalism 

To  completely  define  a  system  using  this  formalism,  we  need  to  give  defini¬ 
tions  for  accessibility  and  distance.  On  accessibility,  Lewis  keeps  an  open  mind. 


3 


He  cannot  imagine  a  convincing  case  showing  the  need  for  inaccessibility  but 
leaves  accessibility  in  the  formalism,  arguing  that  accessibility  restrictions  can  be 
dropped  by  making  all  worlds  equally  accessible.  He  purposely  leaves  out  specifi¬ 
cation  of  possible  distance  measures  in  an  effort  to  make  the  formalism  as  general 
as  possible.  He  argues  that  because  counterfactuals  themselves  are  imprecise,  it 
makes  more  sense  to  be  precise  about  the  link  between  two  imprecise  notions 
than  it  does  to  tie  down  an  imprecise  notion  with  precise  one. 

In  essence,  causal  models  as  discussed  in  this  thesis  represent  a  commitment  to 
a  specific  type  of  distance  metric  (i.e.,  “closest”)  in  Lewis,  one  that  is  amiable  to 
computer  representation.  While  in  Lewis’s  system,  the  distance  metric  is  taken  as 
primitive,  in  our  system  the  distance  metric  is  derived  from  a  more  fundamental 
notion  of  mechanisms.  Our  use  of  causal  mechanisms  as  a  primitive,  rather 
than  a  more  abstract  distance  measure,  allows  for  easier  manipulation  of  causal 
sentences  by  digital  computers.  We  will  show  that,  in  many  cases,  this  specific 
choice  of  the  distance  metric  does  not  affect  the  set  of  counterfactual  statements 
that  we  consider  valid. 

1.2.2  Iwasaki  and  Simon’s  Causality  in  Device  Behavior 

Open  any  physics  textbook  and  you  will  find  the  physical  world  described  with 
systems  of  simultaneous  equations.  Thus,  such  systems  of  equations  would  seem 
to  be  a  natural  vehicle  for  expressing  causation.  Iwasaki  and  Simon  [IS86]  have 
devised  a  method  that  imparts  a  causal  flavor  to  systems  of  equations.  To  the  sets 
of  equations,  they  infer  a  causal  ordering  that  tells  which  variables  causally  affect 
other  variables  in  the  system.  Much  of  their  work  revolves  around  determining 
what  the  causal  order  should  be,  given  a  set  of  equations,  each  describing  a 


4 


physical  law. 

In  contrast,  we  take  causal  ordering  as  given  and  explore  the  utilization  of  the 
ordered  system  as  a  means  for  interpreting  causal  expressions,  and  predicting  the 
effect  of  actions.  Additionally,  we  incorporate  uncertainty  into  the  model  which 
enables  us  to  work  with  partially  specified  systems. 

1.2.3  Bailee’s  Probabilistic  Counterfactuals 

Balke  [Bal95]  investigates  the  relation  between  causal  counterfactuals  and  causal 
models,  and  utilizes  causal  models  to  compute  causal  counterfact ual  queries. 
In  this  formalism,  causal  relations  between  variables  are  mediated  by  response- 
function  variables. 

Consider,  for  instance,  a  simple  system  of  two  binary  variables,  A  and  B, 
such  that  A  has  some  causal  influence  on  P,  and  B  has  no  causal  influence  on 
A.  There  are  four  possible  deterministic  functions  between  A  and  B:  B  always 
has  the  value  bo  (regardless  of  the  value  of  A);  B  has  the  value  bo  if  A  has  the 
value  do,  and  the  value  bi  otherwise;  B  has  the  value  bi  if  A  has  the  value  do, 
and  the  value  bo  otherwise;  B  has  the  value  bi  (regardless  of  the  value  of  A). 
We  can  imagine  a  response  variable  n  that  dictates  which  of  these  deterministic 
functions  describe  the  causal  effect  of  A  on  B.  Thus,  the  value  of  B  becomes 
a  deterministic  function  of  A  and  the  response-function  variable:  b  =  fb{a,  n). 
Since  there  are  four  possible  functional  mappings  from  A  to  B,  the  response- 
function  variable  for  B  has  a  four-valued  domain,  n  €  {0, 1,2,3}.  The  complete 
functional  specification  for  B  is 

b  =  fbia,n)  =  hb,r,{a)  (1.1) 


5 


where  each  response  function  hb,rh  be  specified  as  follows: 


(1.2) 

f6o 

\i  a  =  ao 

if  a  =  oi 

(1.3) 

1  h 

\h 

if  a  =  oo 

hM  =  < 

if  a  =  ai 

(1.4) 

[  bo 

~  b\ 

(1.5) 

A  probability  distribution  over  Vb  completes  the  parameterization  of  the  model. 
Heckerman  and  Schachter  [HS94]  used  a  similar  model  in  their  analysis  of  coun- 
terfactuals 

In  Balke’s  approach,  answering  counterfactual  queries  such  as  “Given  that  we 
observe  the  set  of  variables  C  to  have  the  value  c,  what  is  the  probability  the 
B  would  be  equal  to  b,  if  A  were  a”  is  done  in  the  following  fashion:  (1)  start 
with  some  distribution  over  the  response-function  variables;  (2)  use  observations 
of  variables  in  C  to  obtain  updated  values  for  the  response  variables  via  the 
Bayesian  updating  techniques  of  [Pea88];  (3)  modify  the  response  function  for 
A  so  that  A  always  has  the  value  a;  (4)  using  the  updated  response  functions, 
calculate  a  probability  distribution  for  B. 

1.2.4  Neyman- Rubin’s  Counterfactuals 

Balke’s  response  functions,  as  well  as  the  counterfactual  interpretation  given  in 
this  dissertation,  have  close  connections  to  the  notion  of  “unit  potential  response” 
which  has  been  used  in  statistics  [Ney23,  Rub74]  as  the  basis  for  analyzing  the 
effect  of  treatments  on  populations.  The  unit  potential  response,  denoted  V^(«), 
stands  for  the  counterfactual  sentence  “The  value  that  Y  would  take  in  person  u 
had  X  been  x"  where  X  stands  for  a  type  of  treatment  that  a  person  can  receive. 


6 


The  essential  difference  between  Neyman-Rubin’  framework  and  the  one  explored 
in  this  dissertation  is  that  the  former  takes  54(u)  as  primitive  while  we  treat  it  as 
a  derived  quantity,  computed  from  the  fundamental  of  the  processes  responsible 
for  Y  taking  on  the  value  Yx{u)  as  X  changes  to  x.  We  treat  u  not  as  merely  the 
index  of  an  individual  but,  rather,  as  the  set  of  attributes  u  that  characterize  the 
individual,  the  experimental  conditions  under  study,  and  so  on.  In  fact,  every 
structural  causal  model  can  be  translated  into  a  set  of  counterfactual  statements 
of  the  type  used  in  the  statistical  literature  [Pea95a,  p.  703],  and  conversely, 
every  counterfactual  sentence  can  be  treated  as  a  constraint  over  the  behavior  of 
a  structural  causal  model.  Using  our  process-based  semantics,  however,  uncovers 
properties  of  5^(u)  that  were  not  formalized  in  the  statistical  literature. 

1.3  Contributions 

This  dissertation  gives  a  formal  underpinning  to  some  of  the  intuitive  notions 
of  causation  by  using  the  language  of  structural  models  to  specify  the  precise 
meaning  of  concepts  such  as  causal  relevance  and  causal  influence.  The  principle 
contributions  of  this  dissertation  consist  of: 

•  Construction  of  a  set  of  axioms  for  causal  counterfactuals  that  is  sound  and 
complete  relative  to  the  formal  interpretation  of  action  and  change. 

•  Implementation  of  notions  of  past  state  and  time  within  the  framework  of 
causal  models. 

•  Formal  mathematical  specification  of  two  types  of  causal  irrelevance:  prob¬ 
abilistic  and  deterministic. 


7 


•  Construction  of  a  set  of  axioms  for  causal  irrelevance  that  is  similar  to  the 
set  of  graphoid  axioms  for  informational  irrelevance. 

•  A  method  of  using  graphs  as  theorem  provers  to  validate  properties  of  causal 
irrelevance. 

•  A  polynomial-time  algorithm  for  deciding  the  identifiability  of  control 
queries  given  the  underlying  graph  of  a  causal  model. 

lA  Overview 

In  Chapter  2,  we  give  a  formal  description  of  causal  models,  along  with  exam¬ 
ples  and  motivations.  We  also  prove  some  properties  of  causal  models,  compare 
our  formalism  to  Lewis’s,  and  translate  linguistic  notions  of  causality  into  the 
language  of  structural  models.  In  Chapter  3,  we  explore  ways  of  adding  the  con¬ 
cept  of  memory  to  causal  models.  Our  extensions  allow  for  consideration  of  past 
state  and  time  within  this  formalism.  Chapter  4  sharpens  our  understanding  of 
causality  within  the  framework  of  structural  models.  We  provide  probabilistic 
and  deterministic  definitions  of  causal  relevance  and  develop  some  axioms  over 
each  type  of  irrelevance.  In  Chapter  5,  we  demonstrate  how  causal  models  can 
be  used  to  calculate  quantities  of  interest.  We  take  as  our  example  the  problem 
of  identifiability  and,  ultimately,  present  a  polynomial-time  algorithm  that  de¬ 
termines  the  identifiability  of  a  control  query  by  utilizing  the  graph  of  a  causal 
model.  We  summarize  and  make  concluding  remarks  in  Chapter  6. 


8 


CHAPTER  2 


Causal  Models 


2.1  Introduction 

This  chapter  gives  a  formal  definition  of  causal  models,  which  are  based  on  struc¬ 
tural  equation  models  in  engineering  and  economics  [Haa43,  Gol73,  Sim70].  Un¬ 
like  structural  equation  models,  which  assume  linear  interactions,  causal  models 
allow  for  each  variable  to  be  an  arbitrary  function  of  other  variables  in  the  model. 
Thus,  any  possible  steady-state  interaction  can  be  modeled  using  a  causal  model. 
We  build  our  definition  of  causal  models  in  two  stages.  First,  we  define  an  c^uu- 
tion  set.  We  then  refine  the  definition  of  an  equation  set  to  define  a  causal  model 
(see  [Pea95a]).  Throughout  this  dissertation,  we  will  refer  to  the  terms  causal 
model  and  structural  model  interchangeably. 


2.2  Equation  Set 

An  equation  set  is  a  representation  of  the  causal  mechanisms  that  govern  a  set  of 
variables.  The  variable  set  is  broken  into  two  categories,  exogenous  variables  and 
endogenous  variables.  Each  endogenous  variable  receives  its  value  from  a  deter¬ 
ministic  causal  mechanism.  This  causal  mechanism  is  represented  by  a  determin¬ 
istic  function.  Thus,  for  each  endogenous  variable  X,  there  will  be  a  function  fx 
that  completely  determines  the  value  of  X,  given  all  the  other  variables  in  the 


9 


system.  Formally,  an  equation  set  is  defined  as  follows: 


Definition  1  (equation  set)  An  equation  set  is  a  3-tuple 

M={V,U,F) 

where 

(i)  V  =  {Xi, . . . ,  Xn}  is  a  set  of  endogenous  variables  determined  within  the 

system, 

(ii)  U  =  {17i, . . . ,  Urn}  is  a  set  of  exogenous  variables  that  represent  disturbances, 

abnormalities,  assumptions,  or  boundary  conditions,  and 

(iii)  F  =  {fxi}  is  a  set  ofn  deterministic,  nontrivial  functions,  each  of  the  form 

Xi  =  fxiipai,u)  i  =  l,...,n  (2.1) 

where  pa,-  are  the  values  of  a  set  of  variables  PAi  CV\Xi  (connoting  parents), 
called  the  direct  causes  of  X,. 

Each  equation  x,  =  fx^  has  a  privileged  variable  Xi,  which  appears  only  once 
in  the  equation,  as  the  only  variable  to  the  left  of  the  equals  sign. 

It  is  important  to  note  that  each  equation  x,  =■  fxi  has  a  privileged  variable 
Xi.  The  function  fxi  represents  the  causal  mechanism  that  determines  the  value 
of  Xi.  Thus,  if  we  take  an  equation  set  that  contains  the  variables  Xi  and  Xj,  and 
replace  the  equation  fx^  with  a  linear  combination  of  the  equations  /x,  and  fxj, 
we  would  no  longer  have  an  equation  set,  because  each  equation  would  no  longer 
represent  the  mechanism  that  affects  the  value  of  a  single  privileged  variable. 
Thus,  any  group  of  equations  is  not  an  equation  set. 


10 


The  restriction  that  the  equations  fx^  be  nontrivial  functions  of  the  parents 
pa,-  means  that  for  each  variable  in  pa-,  there  exists  two  values  and  x'^ 
such  that 

/x.(pa-,a:„)  fxi{pai,Xn) 
where  pa'-  is  a  set  of  values  for  pa,-  \  Xn. 

Thus,  fxi{x2,X3)  =  X2  is  a  nontrivial  function  of  X2  but  a  trivial  function  of 
X3.  This  restriction  does  not  so  much  prohibit  which  variables  can  be  parents  as 
define  what  a  parent  is. 

2.3  Valid  Causal  Models 

Now  that  we  have  a  definition  of  an  equation  set,  we  can  use  it  to  define  a  causal 
model.  A  causal  model  is  an  equation  set  that  meets  two  restrictions  on  the 
equations  fxi'. 

Definition  2  (causal  model)  A  casual  model  is  an  equation  set  that  meets  the 
following  two  restrictions: 

(i)  The  set  of  equations  {fxi}  has  a  unique  solution  for  Xi, ...  ,Xn  given  any 
value  of  the  disturbances  Ui,. . . ,  Um- 

(ii)  If  we  replace  any  subset  of  the  set  of  equations  {/xj  with  constant  functions 
fxj  =  c  ( where  c  is  any  constant),  the  remaining  equations  will  also  have  a 
unique  solution  for  any  value  of  the  disturbances  Ui,...,Um- 

Examples  of  causal  models  can  be  found  in  Section  2.6. 

The  uniqueness  assumption  is  equivalent  to  the  requirement  that  F  repre¬ 
sents  a  deterministic  physical  system  in  equilibrium.  Assuming  that  all  relevant 


11 


boundary  conditions  U  are  accounted  for,  such  a  system  can  only  be  in  one  state. 

The  assumption  that  there  is  a  unique  solution  for  Xi, . . .  ,X„  imposes  some 
restrictions  on  the  functions  fx,.  At  first,  this  assumption  sees  overly  restrictive. 
Why  restrict  the  equation  set  to  having  a  unique  fixed  point  for  all  possible  values 
of  the  exogenous  variables?  Doesn’t  this  needlessly  restrict  what  we  can  express 
using  causal  models?  Upon  further  inspection,  however,  this  requirement  is  not 
as  restrictive  as  it  first  appears.  Consider  an  equation  set  with  two  fixed  points 
for  a  particular  value  of  U.  If  we  consider  the  exogenous  variables  to  be  the  world 
state  as  defined  in  physics,  we  have  a  world  state  that  includes  a  variable  whose 
value  is  not  determined.  Thus,  we  have  not  completely  specified  the  world  state 
and  do  not  have  a  complete  model  until  we  add  some  more  information  to  the 
world  state.  If  we  wish  to  describe  a  world  state  in  which  the  value  of  a  variable 
is  only  stochastically  determined,  we  need  to  add  an  additional  variable  Um+i 
that  governs  the  stochastic  interaction  to  U. 

Consider,  for  instance,  the  equation  set  in  Figure  2.1.  The  state  C/i  =  0 
permits  two  possible  solutions  for  X  and  Y — (X  =  l,y  =  1)  and  (X  =  0,y  = 
0) — this  equation  set  is  not  considered  a  causal  model.  Indeed,  this  equation  set 
should  not  be  a  causal  model,  since  it  is  under-defined.  If  Ui  has  the  value  0, 
then  what  should  the  value  of  X  be?  Should  it  rely  on  some  kind  of  stochastic 
process?  If  so,  we  need  to  add  another  variable  U2  to  model  that  process.  Should 
this  equation  set  model  a  type  of  flip-flop,  such  that  X  and  Y  have  the  value  1 
if  Ui  has  the  value  1,  but  retain  their  past  values  if  Ui  has  the  value  0?  If  our 
intent  is  to  describe  a  system  that  employs  a  notion  of  past  states,  then  we  need 
to  define  exactly  what  we  mean  by  a  “previous  state”  and  formally  define  how 
such  a  previous  state  influences  the  present  state.  In  Section  3.2,  we  offer  such 


12 


a  formalism,  which  we  could  use  to  precisely  describe  systems,  such  as  flip-flops, 
that  utilize  the  concepts  of  past  history  and  previous  state. 


It  is  straightforward  to  show  that  every  equation  set  that  represents  a  recursive 
set  of  equations  will  necessarily  be  a  causal  model.  The  variables  in  a  recursive 
equation  set  can  be  listed  in  order  such  that  each  variable  is  listed  before  any  of 
its  descendants.  Since  each  variable  is  a  deterministic  function  of  its  parents,  each 
variable  in  the  list  is  thus  uniquely  determined  as  long  as  all  previous  variables 
on  the  list  are  uniquely  determined.  By  strong  induction,  since  the  first  element 
in  the  list  is  uniquely  determined  by  the  variables  U,  all  elements  in  the  list  must 
be  uniquely  determined  by  U. 

Since  only  nonrecursive  equation  sets  can  be  invalid  causal  models,  it  is  useful 
to  consider  whether  any  nonrecursive  equation  set  that  is  a  valid  causal  model 
exists.  There  are,  in  fact,  many  nonrecursive  equation  sets  that  are  also  valid 
causal  models.  For  example,  consider  the  equation  set  in  Figure  2.2. 

This  equation  set  is  a  valid  causal  model.  The  model,  which  denote  as  M 
dictates  unique  values  for  X  and  Y  for  both  values  of  Ui\  When  X  has 

the  value  1  and  Y  has  the  value  0;  A:(0)  =  1  and  r(0)  =  0.  When  Ui  =  I,  X 


13 


Figure  2.2:  A  valid  non-recursive  causal  model 


has  the  value  1  and  V  has  the  value  1;  A:(1)  =  1  and  7(1)  =  1.  In  addition,  the 
submodels  of  M  also  dictate  unique  solutions.  There  is  a  unique  value  for  Y  (for 
both  values  of  U^)  in  Mx=o  and  Mx=r.  7a-=o(0)  =  0,  7f=o(l)  =  0,  7;^=i(0)  =  0, 
Yx=i{l)  =  1.  There  is  also  a  unique  value  for  X  (for  both  values  of  Ux)  in  Afy=o 
and  My=x:  A:r=o(0)  =  1,  A:r=o(l)  =  1,  A:k=x(0)  =  0,  A:y=i(l)  =  1. 

We  can  see  that  some  nonrecursive  equation  sets  are  valid  causal  models.  Does 
it  make  sense  to  say  that  these  nonrecursive  causal  models  actually  model  real- 
world  phenomena?  In  some  cases,  yes,  it  does  make  sense.  For  instance,  consider 
the  common  economic  model  of  how  supply  and  demand  determine  the  price 
of  a  commodity.  This  is  a  nonrecursive  causal  model  that  models  a  real-world 
phenomena.  The  transient  values  are  usually  considered  to  be  of  no  importance, 
so  they  are  ignored.  If  the  desired  result  is  to  model  how  the  final  price  is 
obtained,  a  more  complicated  model  is  required.  Such  models  are  explored  in 
section  3.3. 

Finally,  given  that  all  linear  equation  sets  are  valid  causal  models,  it  is  useful 
to  consider  just  which  nonlinear  equation  sets  are  also  valid  causal  models.  A 
nonrecursive  equation  set  that  is  created  by  randomly  selecting  functions  over 
binary  variables  will  not  often  be  a  valid  causal  model.  Such  random  equation 


14 


sets  do  not  model  any  real-world  phenomena,  and  as  such  are  not  of  interest  to 
us,  and  do  not  enter  into  our  discussion  of  causal  models. 

The  reason  why  we  restrict  every  subset  as  well  as  the  original  set  of  the 
equations  F  to  having  a  single  fixed  point  will  become  clear  as  we  describe  inter¬ 
ventions  on  causal  models. 

2.4  Interventions 

Definition  2  merely  provides  a  description  of  the  mathematical  objects  that  enter 
into  a  causal  model.  To  fulfill  our  requirement  that  a  causal  model  be  capable 
of  computing  answers  to  causal  queries,  we  need  to  supplement  Definition  2  with 
an  interpretation  of  the  sentence  =  x  causes  Y  =  y.”  In  ordinary  discourse, 
such  a  sentence  implies  that  we  can  bring  about  the  condition  Y  =  y  by  locally 
enforcing  the  condition  X  =  x.  Thus,  Definition  2  must  be  supplemented  with  a 
formal  interpretation  of  the  notion  “locally  enforcing  X  =  x”  that  is  compatible 
with  its  common  usage. 

External  intervention  normally  implies  changing  some  mechanisms  in  the  do¬ 
main.  In  a  logical  circuit,  for  example,  the  act  of  enforcing  the  condition  Xi  =  0 
by  connecting  some  intermediate  variable  Xi  to  ground  amounts  to  changing  the 
mechanism  that  normally  determines  Xi.  If  Xi  is  the  output  of  an  OR  gate,  then 
after  the  intervention,  Xi  would  no  longer  be  determined  by  the  OR  gate  but  by 
a  new  mechanism  (involving  the  ground)  that  clamps  Xi  to  0  regardless  of  the 
input  to  the  OR  gate.  In  the  equational  representation,  this  amounts  to  replacing 
the  equation  x,-  =  /i(pa,-,u)  with  a  new  equation,  Xi  =  0,  that  represents  the 
grounding  of  Xi. 


15 


The  replacement  of  just  one  equation,  not  several,  reflects  the  principle  of 
locality  in  the  common  understanding  of  imperative  sentences  such  as  “Raise 
taxes”  or  “Make  him  laugh.”  When  told  to  clean  his  face,  a  child  does  not  ask  for  a 
razor,  nor  does  he  jump  into  the  swimming  pool.  The  proper  interpretation  of  the 
modal  sentence  “do  p”  corresponds  to  a  minimal  perturbation  of  the  existing  state 
of  affairs,  and  this,  in  the  context  of  Definition  2,  corresponds  to  the  replacement 
of  the  minimal  set  of  equations  necessary  to  make  p  compatible  with  U. 

In  general,  we  will  consider  concurrent  actions  of  the  form  do{^X  =  where 
X  involves  several  variables  in  V}  This  leads  to  the  following  definitions. 

Definition  3  (submodel)  Let  M  be  a  causal  model,  X  a  set  of  variables  in  V, 
and  X  be  a  particular  realization  of  X.  A  submodel  of  M  is  the  causal  model 

M,=  {U,V,F^) 

where 

Fx  =  {fxi  :  Xi  ^  X}U  {X  =  x}  (2.2) 

In  words,  is  formed  by  deleting  from  F  all  functions  /,  corresponding  to 
members  of  X  and  replacing  these  functions  /x,  with  the  set  of  functions  X  =  x. 
Implicit  in  the  definition  of  submodels  is  the  assumption  that  Fx  possesses  a 
unique  solution  for  every  u. 

Submodels  are  useful  in  representing  the  effects  of  local  actions  and  changes. 

If  we  interpret  each  function  /,•  in  F  as  an  independent  physical  mechanism  and 
define  the  action  do{X  =  x)  as  the  minimal  change  in  M  required  to  make  X  =  x 

hold  true  under  any  u,  then  Mx  represents  the  model  that  results  from  such  a 

^The  formalization  of  conditional  actions  of  the  form  “do(X  =  x)  if  Z  =  z"  is  straightforward 
[Pea94]. 


16 


minimal  change,  since  it  differs  from  M  by  only  those  mechanisms  that  determine 
the  variables  in  X. 

Definition  4  (effect  of  action)  Let  M  be  a  causal  model,  X  be  a  set  of  variables 
in  V,  and  X  a  particular  realization  of  X.  The  effect  of  action  do{X  =  x)  on  M 
is  given  by  the  submodel  M^. 

Definition  5  (potential  response)  Let  Y  be  a  variable  in  V,  and  let  X  be  a  subset 
of  V .  The  potential  response  of  Y  to  action  do{X  =  x),  denoted  y;(u),  is  the 
solution  for  Y  of  the  set  of  equations  F^. 

Definition  6  (counterfactual)  Let  Y  be  a  variable  in  V,  and  let  X  a  subset  of 
V .  The  counterfactual  sentence  “The  value  that  Y  would  have  obtained,  had  X 
been  x  ”  is  interpreted  as  denoting  the  potential  response  l^(u) } 

The  syntactical  transformation  described  in  Definition  5  corresponds  to  re¬ 
placing  the  old  functional  mechanisms  x,  =  fii^P Ai,u)  with  new  mechanisms 
Xi  —  Xi  that  represent  the  external  forces  that  set  the  values  Xi  for  each  Xi  €  X. 
As  before,  we  assume  each  variable  F  6  7  to  be  a  unique  function  of  the  back¬ 
ground  U  in  any  model  M^,  namely,  Y  =  Ym^u)-  For  brevity,  we  will  often  omit 
the  subscript  M,  leaving  Yx{u). 

An  explicit  translation  of  intervention  into  “wiping  out”  equations  in  the 
causal  model  was  first  proposed  in  [SW60]  and  used  in  [Fis70]  and  [Sob90].  Graph¬ 
ical  ramifications  are  explicated  in  [SGS93]  and  [Pea93].  Interpretations  of  causal 

^The  connection  between  counterfactuals  and  local  actions  is  made  in  [Lew73a]  and  is  elab¬ 
orated  in  [BP94]  and  [HS95].  Readers  who  are  disturbed  by  the  impracticality  of  actions  in  the 
interpretation  of  some  counterfactuals  (e.g.,  “If  I  were  young”)  are  invited  to  replace  the  word 
action”  with  the  word  “modification”  (see  [Lea85]).  [Pea95a,  p.  706]  explains  the  advantage  of 
using  hypothetical  external  interventions,  rather  than  spontaneous  changes,  in  thinking  about 
causation  and  counterfactuals. 


17 


and  counterfactual  utterances  in  terms  of  K('w)  are  provided  in  [Pea96a].  Alter¬ 
native  formulations  of  causality,  in  terms  of  event  trees,  are  given  in  [Rob86b] 
and  [Sha96]. 

Note  that  Yx{u)  is  well  defined  even  when  U  =  u  and  X  -  x  axe  incompatible 
in  M  (i.e.,  X[u)  7^  x)  Thus,  there  is  room  in  the  model  for  actions  to  enforce 
propositions  that  are  not  realized  under  normal  conditions,  or  that  are  not  real¬ 
ized  under  the  abnormalities  modeled  in  U.  For  example,  if  M  describes  a  logic 
circuit  we  might  wish  to  intervene  and  set  some  voltage  X  to  x,  even  though  the 
input  dictates  X  ^  x.  It  is  for  this  reason  that  one  must  invoke  some  notion  of 
mechanism  breakdown  or  “surgery”  in  the  definition  of  interventions. 

The  unique  feature  of  our  formulation  of  actions — the  feature  that  sets  it  apart 
from  the  formulations  in  control  theory  or  decision  analysis  [Sav54,  HS95]— is 
that  an  action  is  treated  as  a  modality ,  namely,  it  is  not  given  an  explicit  name 
but,  rather,  acquires  the  names  of  the  propositions  that  it  enforces  as  true.  This 
enables  the  model  to  predict  the  effects  of  a  huge  number  of  action  combina¬ 
tions  without  the  modeler  having  to  attend  to  such  combinations.  Instead,  the 
causal  model  is  constructed  by  specifying  the  characteristics  of  each  individual 
mechanism  under  normal  conditions,  free  of  intervention. 

2.5  Probabilistic  Causal  Models 

If  we  wish  to  use  causal  models  to  do  probabilistic  reasoning,  we  need  to  add 
probability  to  the  causal  model  framework.  In  probabilistic  causal  models,  dis¬ 
turbances  U  are  described  with  a  probability  distribution. 


18 


Definition  7  (probabilistic  causal  model)  A  probabilistic  causal  model  is  a  tuple 

Where 

(i)  M  is  a  causal  model  {V,U,F) 

(ii)  P{u)  is  a  probability  distribution  over  U,  such  that  each  element  Ui  €  U  is 
marginally  independent  of  all  other  elements  of  U . 

The  restriction  on  the  probability  distribution  P{u)  that  all  members  of  U  are 
independent  does  not  limit  the  expressive  power  of  probabilistic  causal  models. 
Any  P{u)  with  dependencies  can  be  modeled  by  combining  elements  of  U. 

Given  that  each  endogenous  variable  in  a  probabilistic  causal  model  is  a  func¬ 
tion  of  U  and  that  a  probabilistic  causal  model  specifies  a  probability  distribution 
over  U ,  we  can  define  a  probability  distribution  over  the  endogenous  variables  in 
a  probabilistic  causal  model.  That  is,  for  every  set  of  variables  F  C  y,  we  have 

P{y)  =  E  (2.3) 

{u  I  y(t»)=j/} 

The  probability  of  counterfactual  statements  is  defined  in  the  same  mariner, 
through  the  function  Yx{u)  induced  by  the  submodel  M^: 

P{Y^  =  y)=  Yj  P{u)  (2.4) 

{u  I  Y,(u)=y} 

We  note  that  a  causal  model  defines  a  joint  distribution  on  all  counterfac¬ 
tual  statements,  that  is,  P{Yx  =  y,Zw  =  z)  is  defined  for  any  sets  of  variables 
Y,X,Z,W,  not  necessarily  disjoint.  In  particular,  P{Yx  =  y,Yx>  =  y')  is  well 


19 


defined  and  is  given  by  |  Y:,(u)=y  &  Y^,{u)=:y'  Likewise,  P(K  =  y,X  =  x') 

is  well  defined  and  is  given  by  |  y;(u)=y  &  A’{«)=i:' 

We  can  use  probabilistic  causal  models  to  obtain  a  definition  for  the  causal 
effect  of  one  variable  on  another  variable: 

Definition  8  (causal  effect)  Given  two  disjoint  sets,  X  ^  V  and  Y  E  V,  the 
causal  effect  of  X  on  Y  is 

P(y\x)  =  P(Y.  =  y)  (2.5) 

=  E  ^(“)  (2-«) 

u  I  Yx{u)=y 

for  all  values  x  £  X 

In  the  statistical  literature,  causal  effect  is  usually  defined  as  the  difference 
in  expected  values  in  Y ,  assuming  X  —  x  ox  X  =  x' .  Our  definition  is  subtly 
different,  since  we  consider  the  value  for  Y  over  the  entire  range  of  X,  not  just 
the  difference  between  two  values,  x  and  x' . 


2.6  Examples 

Next  we  demonstrate  the  generality  of  the  mathematical  object  defining  causal 
models  using  two  familiar  applications:  evidential  reasoning  and  linear  structural 
equation  models. 


2.6.1  Sprinkler  Example 


®The  existence  of  such  a  joint  distribution  has  prompted  some  of  the  objections  to  treating 
counterfactuals  as  random  variables,  because,  when  x  and  x'  are  incompatible,  it  is  hard  to 
attribute  probability  to  the  joint  statement  ‘T  would  he  y  if  X  were  x  and  X  is  actually  x'.” 
The  definition  of  Yx  in  terms  of  submodel  not  only  avoids  such  problems  but  also  illustrates 
that  such  joint  probabilities  can  be  encoded  rather  parsimoniously  using  P{u)  and  F . 


20 


Figure  2.3:  Causal  graph  illustrating  causal  relationships  among  five  variables 

Figure  2.3  is  a  simple  yet  typical  causal  graph  used  in  commonsense  reasoning. 
It  describes  the  causal  relationships  among  the  season  of  the  year  (Xi),  whether 
rain  falls  {X^)  during  the  season,  whether  the  sprinkler  is  on  (X3),  whether  the 
pavement  is  wet  (X4),  and  whether  the  pavement  is  slippery  (X5).  All  vari¬ 
ables  in  this  graph  except  the  root  variable  X\  take  a  value  of  either  “True”  or 
“False.”  Xi  takes  one  of  four  values:  “Spring,”  “Summer,”  “Fall,”  or  “Winter.” 
Here,  the  absence  of  a  direct  link  between,  for  example,  Xi  and  X^  captures  our 
understanding  that  the  influence  of  the  season  on  the  slipperiness  of  the  pave¬ 
ment  is  mediated  by  other  conditions  (e.g.,  the  wetness  of  the  pavement).  The 
corresponding  model  consists  of  five  functions,  each  representing  an  autonomous 
mechanism: 


Xi  =  U\ 

X2  =  f2ixi,U2) 

X3  =  /3(xi,U3) 


21 


(2.7) 


—  ^*4(^3?  ^2?  ^4) 

X5  =  f5{x4,U5) 

The  disturbances  [/i, . . . ,  C/5  are  not  shown  explicitly  in  Figure  2.3  but  are  under¬ 
stood  to  govern  the  uncertainties  associated  with  the  causal  relationships.  The 
causal  graph  coincides  with  the  Bayesian  network  associated  with  P{xi,. . .  ,0:5) 
whenever  the  disturbances  are  assumed  to  be  independent,  Ui  1|  \  Ui. 

A  typical  specification  of  the  functions  {/i, .  •  • ,  /s}  and  the  disturbance  terms 
is  given  by  the  Boolean  model 

X2  =  [(Xi  =  Winter)  V  i^X\  =  Fall)  V  062]  A  ~'ab'2 

xs  =  [(Xi  =  Summer)  V  (Xi  =  Spring)  V  abs]  A  -■afta 

X4  =  (a^2  V  X3  V  064)  A  “>064 

X5  =  {x4  V  065)  A  “■065  (2-8) 

where  Xi  stands  for  Xi  =  true,  and  a6,-  and  ab^  stand,  respectively,  for  triggering 
and  inhibiting  abnormalities.  For  example,  064  stands  for  (unspecified)  events 
that  might  cause  the  pavement  to  get  wet  (0:4)  when  the  sprinkler  is  off  {-‘X2) 
and  it  does  not  rain  {-'X3)  (e.g.,  pouring  a  pail  of  water  on  the  pavement),  while 
-'064  stands  for  (unspecified)  events  that  will  keep  the  pavement  dry  {-'X4)  in 
spite  of  rain  falling  (X3),  the  sprinkler  being  on  (^2),  and  064  (e.g.,  covering  the 
pavement  with  a  plastic  sheet). 

To  represent  the  action  “turning  the  sprinkler  ON,”  or  do{Xz  =  ON),  we  re¬ 
place  the  equation  X3  =  fz{xi,uz)  in  the  model  of  Eq.  (2.7)  with  X3  =  ON.  The 
resulting  submodel,  contains  all  the  information  needed  for  computing 

the  effect  of  the  action  on  the  other  variables.  It  is  easy  to  see  from  this  sub¬ 
model  that  the  only  variables  affected  by  the  action  are  X4  and  Xs,  that  is,  the 


22 


descendants  of  the  manipulated  variable  X3.  Note,  however,  that  the  operation 
do{X3  =  ON)  stands  in  marked  contrast  to  that  of  finding  the  sprinkler  ON;  the 
latter  involves  making  the  substitution  X3  =  ON  without  removing  the  equation 
for  X3,  and  therefore  may  potentially  influence  (the  belief  in)  every  variable  in 
the  network.  This  mirrors  the  difference  between  seeing  and  doing:  after  observ¬ 
ing  that  the  sprinkler  is  ON,  we  may  wish  to  infer  that  the  season  is  dry,  that 
it  probably  did  not  rain,  and  so  on;  no  such  inferences  can  be  drawn  about  the 
reasons  for  the  action  “turning  the  sprinkler  ON.” 

2.6.2  Policy  Analysis  in  Linear  Econometric  Models 

Causal  models  are  often  used  to  predict  the  effect  of  policies  on  systems  in  dy¬ 
namic  equilibrium.  In  the  economic  literature,  for  example,  we  find  the  system 
of  equations 


q  =  bip  +  dii  +  ui  (2.9) 

P  =  b2q  +  d2W  +  U2  (2.10) 

where  q  is  the  quantity  of  household  demand  for  a  product  A,  p  is  the  unit  price 
of  product  A,  i  is  household  income,  w  is  the  wage  rate  for  producing  product 
A,  and  ui  and  U2  represent  error  terms,  namely,  unmodeled  factors  that  affect 
quantity  and  price,  respectively  [Gol92]. 

This  system  of  equations  constitutes  a  causal  model  (Definition  2)  if  we  define 
V  =  {Q,P^,  U  =  {f/i,  C/2, /,  VK}  and  assume  that  each  equation  represents  an 
autonomous  process  in  the  sense  of  Definition  4.  The  causal  graph  of  this  model 
is  shown  in  Figure  2.4.  It  is  normally  assumed  that  I  and  W  are  known,  while 
Ui  and  U2  are  unobservable  and  independent  in  /  and  W.  Since  the  error  terms 


23 


Figure  2.4:  Causal  graph  illustrating  the  relationship  between  supply  and  demand 

Ui  and  U2  are  unobserved,  the  model  must  be  augmented  with  the  distribution 
of  these  errors,  which  is  usually  taken  to  be  a  Gaussian  distribution  with  the 
covariance  matrix  Eij  =  cov{ui,  uj). 

We  can  use  this  model  to  answer  queries  such  as: 

1.  Find  the  expected  demand  (Q)  if  the  price  is  controlled  at  P  =  po. 

2.  Find  the  expected  demand  (Q)  if  the  price  is  reported  to  be  P  =  po. 

3.  Given  that  the  current  price  is  P  =  po,  find  the  expected  demand  (Q)  had 
the  price  been  P  =  pi. 

To  find  the  answer  to  the  first  query,  we  replace  Eq.  (2.10)  with  p  =  po, 
leaving 

q  =  61P  +  dii  +  ui 

=  Vo 


P 


24 


(2.11) 

(2.12) 


The  demand  is  then  q  =  pobi  +  dii  +  ui,  and  the  expected  value  of  Q  can  be 
obtained  from  i  and  the  expectation  of  Ui,  giving  E[Q\po]  =  E[Q]  +  bi{p-E[P])  + 
d^{^-E[I]}. 

The  answer  to  the  second  query  is  given  by  conditioning  Eq.  (2.9)  on  the 
current  observation  {P  =  po?  I  =  i^W  =  w}  and  taking  the  expectation, 

E[Q\po,i,w]  =  bipo  +  dii  +  E[Ui\po,i,w].  (2.13) 

The  computation  of  E[Ui\pQ.,i,w]  is  a  standard  procedure  once  is  given 
[Med69].  Note  that,  although  Ui  was  assumed  independent  of  I  and  W,  this 
independence  no  longer  holds  once  P  =  po  is  observed.  Note  also  that  Eqs.  (2.9) 
and  (2.10)  both  participate  in  the  solution  and  that  the  observed  value  po  will 
affect  the  expected  demand  Q  (through  P[C/j|po,*,u>])  even  when  bi  =  0,  which 
is  not  the  case  in  query  1. 

The  third  query  requires  the  conditional  expectation  of  the  counterfactual 
quantity  (5p=pj,  given  the  current  observations  {P  =  po,  /  =  i,  W  =  u;},  namely, 

E[Qp=pi\poihw]  =  bipi  +  dii  +  E\Ui\pQ,i,w\  (2-14) 

The  expected  value  E[Ui\poii-,w]  is  the  same  in  the  solutions  to  the  second  and 
third  queries;  the  latter  differs  only  in  the  term  bipi.  A  general  method  for  solving 
such  counterfactual  queries  is  described  in  [BP95]. 

2.6.3  Linguistic  Notions  of  Causality 

Structural  models  provide  a  precise  language  for  defining  intuitive  causal  con¬ 
cepts.  In  this  section,  we  provide  some  brief  examples,  all  relating  to  a  given 
structural  model  M. 


25 


•  is  a  cause  of  F”  if  there  exist  two  values  x  and  x'  of  X  and  a  value  u 
of  U  such  that  Yxiu)  ^  Yx'{u). 

•  “X  is  a  cause  of  Y  in  context  Z  =  z”  if  there  exist  two  values  x  and  x'  of 
X  and  a  value  u  of  U  such  that  Yxz  7^  Yx’z{u). 

•  “X  is  a  direct  cause  of  F”  if  there  exist  two  values  x  and  x'  of  X  and  a 
value  u  oiU  such  that  Yxr{u)  ^  Yz;ir{u)  where  r  is  some  realization  of  F\X. 

The  direct  causes  of  a  variable  in  a  model  are  determined  in  part  by  the 
granularity  of  the  model  in  question.  For  example,  consider  the  causal  model  in 
Figure  2.3.  The  sprinkler  is  a  not  a  direct  cause  of  slippery  pavement.  However, 
we  could  change  the  granularity  of  the  model,  by  removing  the  variable  “wet 
pavement,”  and  having  the  rain  and  sprinkler  affect  “slippery  pavement”  directly. 
Then  the  sprinkler  would  be  a  direct  cause  of  slippery  pavement.  Likewise,  in 
Figure  2.3,  the  sprinkler  is  a  direct  cause  of  wet  pavement.  If  we  added  a  variable 
for  “airborne  water”  such  that  it  was  true  if  either  the  sprinkler  was  on  or  it 
was  raining,  and  such  that  the  pavement  be  wet  if  “airborne  water”  was  true, 
then  the  sprinkler  would  no  longer  be  a  direct  cause  of  wet  pavement.  All  of 
these  causal  models  model  the  same  phenomena,  but  the  set  of  direct  causes  is 
different,  depending  upon  the  granularity  of  the  model. 

•  “X  is  an  indirect  cause  of  F”,  if  is  a  cause  of  F,  and  X  is  not  a  direct 
cause  of  F.  As  in  the  previous  example,  the  granularity  of  the  model  M 
will  determine  which  causes  are  direct  and  indirect. 

•  “A"  is  causally  irrelevant  to  F,  given  fixed  Z”  if  Vu,  z,  x,  x'  Yxz{u)  =  F:'z(m) 
in  every  submodel  of  Mz-  Causal  irrelevance  is  an  important  concept  that 
will  be  thoroughly  explored  in  Chapter  4. 


26 


•  “Event  X  =  X  may  have  caused  Y  =  x"  if 


(i)  X  =  X  and  Y  =  y  are  true,  and 

(ii)  There  exists  a  value  u  oi  U  such  that  X{u)  =  x,  Y{u)  =  y,Yx{u)  =  y 
and  Yx'{u)  ^  y  for  some  x'  ^  x. 

•  “The  unobserved  event  A"  =  a;  is  a  likely  cause  of  F  =  ?/”  if 

(i)  Y  =  y  is  true,  and 

(ii)  P{Yx  =  j/,  F:/  ^  t/|F  =  j/)  is  high  for  some  x'  ^  x 

•  “Event  Y  =  y  occurred  despite  X  =  x”  if 

(i)  X  =  X  and  Y  =  y  are  true,  and 

(ii)  P(Yx  =  y)  is  low. 

The  preceding  list  demonstrates  that,  by  varying  the  quantifiers  of  U  and  X, 
we  have  the  flexibility  to  find  appropriate  formalizations  for  many  nuances  of 
causal  expressions. 

2.7  Properties  of  Counterfactual  Statements 

We  now  provide  some  definitions  and  some  properties  that  are  true  for  all  causal 
models. 

2.7.1  Definitions  of  Causal  Properties 

Property  1  (composition)  For  any  two  singleton  variables  Y  and  W  and  any 
set  of  variables  X  in  a  causal  model, 

Wx{u)  =  w  =^Yxw{u)  =  Yx{u)  (2.15) 


27 


Composition  states  that,  in  any  context  Z  =  z,  if  we  force  a  variable  to  a  value 
that  it  would  have  had  without  our  intervention,  then  the  intervention  will  have 
no  effect  on  other  variables  in  the  system. 

Since  composition  allows  for  the  removal  of  a  subscript  (i.e.,  reducing 
to  i^(u)),  we  need  an  interpretation  for  a  variable  with  an  empty  set  of  subscripts 
which,  naturally,  we  identify  with  the  variable  under  no  interventions. 

Definition  9  (null  action)  V^iu)  =  V(u). 

Corollary  1  (consistency)  For  any  variables  Y  and  W  in  a  causal  model, 

W {u)  =  w  Y {u)  =  Yyj{u)  (2.16) 

Corollary  1  follows  directly  from  composition  and  null  action.  The  implication 
in  Eq.  (2.16)  was  called  consistency  hy  [RobS?].'* 

Property  2  (effectiveness)  For  all  variables  Y  and  set  of  variables  X  in  a  causal 
model, 

Y,y{u)  =  y  (2.17) 

Effectiveness  specifies  the  effect  of  an  intervention  on  the  manipulated  variable 
itself,  namely,  that  if  we  force  a  variable  Y  to  have  the  value  ?/,  then  regardless 
of  other  enforcements  X  x^Y  will  indeed  take  on  the  value  y, 

"^This  property  and  composition  are  tacitly  used  in  economics  [Man90]  and  statistics  within 
the  so-called  Rubin’s  model  [Rub74].  To  the  best  of  our  knowledge,  Robins  was  the  first  to 
state  consistency  formally  and  to  use  it  to  derive  other  properties  of  counterfactuals  [Rob87]. 
Composition  was  brought  to  our  attention  by  Jamie  Robins  (personal  communication,  February 
1995).  A  weak  version  of  composition  is  mentioned  explicitly  in  [H0I86,  p.  968]. 


28 


Property  3  (reversibility)  For  any  two  variables  Y  and  W  and  any  set  of  vari¬ 
ables  X  in  a  causal  model, 

(Fx^(u)  =y)  k  {W^y{u)  =  w)=^  y;(u)  =  y  (2.18) 

Reversibility  reflects  memoryless  behavior  —  the  state  of  the  system,  V,  tracks 
the  state  of  U,  regardless  of  C/’s  history.  Given  a  context  X  =  a;  as  in  Eq.  (2.18), 
if  forcing  W  to  a,  value  w  results  in  a  value  y  for  Y  and,  in  turn,  forcing  Y  to  y 
indeed  results  in  W  =  w,  then  W  and  Y  will  have  the  values  w  and  y,  respectively, 
without  any  intervention.  This  follows  from  the  requirement  that  the  equations 
in  every  context  X  =  x  have  a  unique  solution.  Thus,  if  we  assume  a  solution 
W  =  w  and  obtain  Y  =  y  and,  in  turn,  assuming  a  solution  Y  =  y  yields  W  =  w, 
then  {W  =  w,Y  =  y)  is  indeed  the  solution  to  the  equations. 

A  typical  example  of  irreversibility  is  a  system  of  two  agents  who  adhere  to  a 
tit-for-tat  strategy  (e.g.,  the  prisoners’  dilemma).  Such  a  system  has  two  stable 
solutions,  cooperation  and  defection,  under  the  same  external  conditions  U,  and 
thus  it  does  not  satisfy  the  reversibility  condition;  forcing  either  one  of  the  agents 
to  cooperate  results  in  the  other  agent’s  cooperation  (E;„(u)  =  y,Wy(u)  =  w), 
yet  this  does  not  guarantee  cooperation  from  the  start  (F(u)  =  y,W{u)  =  w). 
Irreversibility,  in  such  systems,  is  a  product  of  using  a  state  description  that  is 
too  coarse,  one  in  which  all  of  the  factors  that  determine  the  ultimate  state  of 
the  system  are  not  included  in  U.  In  a  tit-for-tat  strategy,  the  state  description 
should  include  factors  such  as  the  previous  actions  of  the  players,  and  reversibility 
is  restored  once  the  missing  factors  are  included. 

In  recursive  systems,  reversibility  follows  directly  from  composition.  This  can 
easily  be  seen  by  noting  that  in  a  recursive  system,  either  Yxuj  (u)  =  i;(u)  or 
Wxy{u)  =  Wx{u).  Thus,  reversibility  reduces  to  {Yxm{u)  =  y)  ^  (kl4(u)  =  w)=^ 


29 


Yx{u)  =  y,  which  is  another  form  of  composition,  or  to  (]r^(w)  =  y)  k  {Wxy{u)  = 
w)  Yxiu)  =  y,  which  is  trivially  true.  In  nonrecursive  systems,  reversibility  is 
a  property  of  causal  loops.  If  forcing  VF  to  a  value  w  results  in  a  value  y  for  Y, 
and  forcing  Y  to  the  value  y  results  in  W  achieving  the  value  w,  then  W  and  Y 
will  have  the  values  w  and  y,  respectively,  without  any  intervention. 

2.7.2  Soundness  of  Composition,  Effectiveness,  and  Reversibility 

Following  standard  logic,  we  will  consider  a  property  of  causal  relationships  to 
be  sound  if  that  property  holds  in  all  structural  models. 

Theorem  1  Composition  is  sound. 

Proof: 

Since  Yx{u)  has  a  unique  solution,  forming  Mx  and  substituting  out  all  other 
variables  will  yield  a  unique  solution  for  F,  regardless  of  the  order  of  substitution. 
So  we  will  form  and  examine  the  structural  equation  for  Y  in  Yx  = 
fY{x,z^Wyu),  where  Z  stands  for  the  rest  of  the  parent  set  of  Y.  To  solve  for 
Z,  we  substitute  out  all  variables  except  X,  Y,  and  W.  In  other  words,  we 
substitute  out  all  variables  in  Mx  without  substituting  into  X,  VF,  and  F,  and 
express  Z  as  a  function  of  x,u;,  and  u.  We  then  plug  this  solution  into  fy  to 
get  Yx  =  fy{x,w,  Z{xtW.iu),u),  which  we  can  write  as  F:  =  f{x,w,u).  At  this 
point,  we  can  solve  for  W  by  substituting  out  all  variables  in  Mx  other  than  X, 
which  leaves  F  =  f{x,W{u,x),u).  We  can  now  see  that  if  it;  =  Wx{u),  then 
Yxi^U^  “  CD 

This  proof  is  still  valid  in  cases  where  JA  =  0. 

Theorem  2  Effectiveness  is  sound. 


30 


Proof: 


This  theorem  follows  from  Definition  2,  where  Yx{u)  is  interpreted  as  the  unique 
solution  for  K  of  a  set  of  equations  under  X  =  x.  □ 

Theorem  3  Reversibility  is  sound. 

Proof: 

Reversibility  follows  from  the  assumption  that  the  solution  for  V  in  every  sub¬ 
model  is  unique.  Since  Yx{u)  has  a  unique  solution,  forming  Mx  and  substituting 
out  all  other  variables  will  yield  a  unique  solution  for  Y,  regardless  of  the  order 
of  substitution.  So  we  will  form  Mx  and  examine  the  structural  equation  for  Y  in 
Mx,  which  in  general  might  be  a  function  of  X,  W,  U,  and  additional  variables: 
Yx  =  fyix,  w,  z,  u),  where  Z  stands  for  parents  of  Y  not  contained  in  XliWiJU. 
We  now  solve  for  Z  by  substituting  out  all  variables  except  X,  Y,  and  W.  That 
is,  we  substitute  out  all  variables  in  Mx,  without  substituting  into  X,  W,  and  Y 
and  express  Z  as  a.  function  of  x,w,  and  u.  We  then  plug  this  solution  into  fy 
to  get  Yx  =  fy{x,w,Z{x,w,u),u),  which  we  can  write  as  3^  =  f{x,w,u).  We 
now  consider  what  would  happen  if  we  solved  for  Y  in  Mxw  Since  we  avoided 
substituting  anything  into  W  when  we  solved  for  Y  in  Mx,  we  will  get  the  same 
result  as  before,  namely,  Yxw  =  f{x,w,u).  In  the  same  way,  we  can  show  that 
Wx  =  g{x,  y,  u)  and  Wxy  =  g{x,  y,  u).  So,  solving  for  y  =  Yx{u),  w  =  Wx{u)  is  the 
same  as  solving  for  y  =  f{x,  w,  u)  and  w  =  g{x,  y,  u),  which  is  the  same  as  solving 
for  y  =  Yxyj{u),  w  =  Wxy{u).  Thus,  any  solution  y  io  y  =  Yxvj{y),w  =  Wxy{u)  is 
also  a  solution  to  ?/  =  i^(u).  □ 


31 


2.7.3  Independence  of  Effectiveness,  Composition,  and  Reversibility 

We  now  show  that  effectiveness,  composition,  and  reversibility  are  independent. 
Theorem  4  Effectiveness,  composition,  and  reversibility  are  independent. 

Proof: 

In  nonrecursive  systems,  the  three  properties  of  composition,  effectiveness,  and 
reversibility  are  independent — none  is  a  consequence  of  the  other  two.  This  can 
be  shown  by  constructing  a  truth  table  for  counterfact ual  statements,  such  that 
any  two  properties  hold  and  the  third  does  not.  Consider  a  model  of  two  binary 
variables  X  and  Y,  and  a  single  value  u  of  U.  Table  2.1  is  a  truth  table  for 
counterfactual  statements  over  X  and  Y  such  that  composition  and  effectiveness 
hold  but  reversibility  does  not.  Table  2.2  is  a  similar  table  where  effectiveness 
and  reversibility  hold  but  composition  does  not.  Finally,  table  2.3  is  a  truth  table 
where  composition  and  reversibility  hold  but  effectiveness  does  not. 

2.7.4  Completeness  of  Causal  Properties 

Examining  these  properties  of  causal  models  raises  two  ob  ious  questions.  One 
is  “Have  we  missed  any?”  That  is,  are  there  any  restrictions  on  legal  causal 
sentences  that  are  not  captured  by  the  above  properties?  The  other  is  “How  are 
these  properties  different  from  those  derived  in  other  systems?”  That  is,  does 
our  formalism  impose  more  restrictions  than  other  systems  on  the  number  of 
valid  causal  statements?  How  does  the  number  of  valid  causal  statements  in  our 
system  compare,  for  instance,  to  the  number  in  the  very  general  one  of  Lewis? 
This  section  will  answer  both  of  these  questions,  at  least  for  the  case  of  recursive 
models. 


32 


x  =  o 

r  =  0 

o 

II 

o 

II 

o 

II 

o 

II 

II 

O 

II 

o 

II 

O 

II 

O 

II 

o 

II 

O 

=  1 

Tx=i  =  1 

II 

o 

II 

O 

11 

o 

II 

II 

o 

II 

o 

II 

O 

II 

o 

II 

-^A’=i,r=o  =  1 

O 

II 

o 

II 

II 

=  1 

=  1 

Xx=\,Y=\  =  1 

1 

II 

ft 

II 

II 

Tabic  2.1:  Table  of  counterfactual  statements  where  composition  and  effectiveness 
hold  but  reversibility  does  not 


II 

o 

r  =  1 

o 

II 

o 

II 

T— H 

II 

O 

11 

11 

O 

II 

o 

II 

O 

II 

o 

II 

O 

II 

O 

Xx^x  =  1 

il 

►-* 

II 

o 

-^A^=0,K=1  =  0 

t-H 

II 

II 

o 

II 

II 

o 

II 

o 

II 

O 

II 

o 

Xx=l,Y=0  =  1 

II 

(-1 

II 

o 

II 

O 

II 

1-1 

II 

=  1 

=  1 

yiy=i,y=i  =  1 

Table  2.2:  Table  of  counterfactual  statements  where  effectiveness  and  reversibility 
hold  but  composition  does  not 


33 


o 

II 

1— 1 

II 

II 

O 

II 

o 

II 

O 

1! 

li 

O 

li 

o 

II 

O 

II 

o 

II 

o 

=  1 

II 

t-* 

II 

o 

Yx=i  =  1 

II 

o 

II 

1— 1 

II 

o 

i! 

o 

11 

=  1 

o 

II 

o 

II 

Fk=o  =  1 

O 

II 

o 

II 

II 

Yx=\,y=o 

II 

o 

II 

IT 

=  1 

Xx=l,Y=\  =  0 

II 

II 

=  1 

Table  2.3:  Table  of  counterfactual  statements  where  composition  and  reversibility 
hold  but  effectiveness  does  not 


Lewis  was  very  careful  to  keep  his  formalism  as  general  as  possible,  and, 
save  for  the  obvious  requirement  that  every  world  be  closest  to  itself,  he  did  not 
impose  any  specific  structure  on  the  distance  measure.  However,  the  fact  that 
people  manage  to  communicate  with  counterfactuals  suggests  that  the  distance 
measure  is  shared  by  many  people  and,  hence,  that  it  is  not  entirely  arbitrary 
but  must  be  one  which  can  be  encoded  parsimoniously  in  the  mind.  What  then 
is  the  mental  representation  used  in  the  encoding  of  interworld  distances? 

Lewis  himself  provides  a  clue:  the  closest  worlds  that  he  envisions  are  causal 
in  nature.  For  instance,  when  Lewis  considers  as  an  example  a  hypothetical  world 
in  which  kangaroos  have  no  tails,  he  argues  that  not  just  the  state  of  the  tail,  but 
also  the  tracks  that  the  animal  makes,  the  animal’s  balance,  and  a  variety  of  other 
factors  would  also  be  different.  Thus,  Lewis  appeals  to  our  common  knowledge 
of  cause  and  effect  in  laying  out  which  factors  are  expected  to  be  different  in  the 


34 


hypothetical  world  and  which  factors  are  expected  to  be  unaltered. 

If  our  assessment  of  interworld  distances  comes  from  causal  knowledge,  the 
question  of  whether  that  knowledge  imposes  its  own  structure  on  distances,  a 
structure  that  is  not  captured  in  Lewis’s  logic,  arises.  Phrased  differently,  by 
agreeing  to  measure  closest  worlds  on  the  basis  of  causal  relations,  do  we  restrict 
the  set  of  counterfactual  statements  we  regard  as  valid.  The  question  is  not  merely 
theoretical.  For  example,  Gibbard  and  Harper  [1981]  characterize  sentences  of  the 
form  “If  we  do  A,  then  S”  using  Lewis’s  general  framework,  while  Pearl  [Pea95a] 
developed  a  calculus  of  action  based  directly  on  causal  models —  are  the  two 
formalisms  are  identical? 

For  recursive  systems,  the  answer  is  yes.  As  we  prove  next,  given  a  causal 
ordering  on  the  variables  in  the  system,  composition  is  complete.®  Thus,  for 
recursive  systems,  we  know  that  we  have  not  missed  any  causal  properties,  and 
that  our  formalism  imposes  no  more  restrictions  than  Lewis’s  on  the  validity  of 
causal  or  counterfactual  statements. 

Theorem  5  Composition,  together  with  effectiveness,  definiteness,  and  unique¬ 
ness,  are  complete  for  causally  ordered  systems. 

We  first  give  some  notational  definitions  that  we  will  use  in  our  proof.  A 
formal  proof  of  completeness  requires  two  additional  properties,  definiteness  and 
uniqueness.® 

®The  formal  completeness  proof  requires  effectiveness  and  two  other  technical  definitions. 
Composition  (and,  for  nonrecursive  systems,  reversibility)  encapsulate  the  essence  of  structural 
models. 

®These  two  properties,  definiteness  and  uniqueness,  were  kept  implicit  in  the  completeness 
proof  originally  reported  in  [GP97];  the  need  to  explicate  them  formally  was  brought  to  my 
attention  by  [Hal97]. 


35 


Property  4  (definiteness)  For  any  variable  X  and  set  of  variables  Y, 

3x  ^  X  s.t.  Xy{u)  =  X  (2.19) 

Property  5  (uniqueness)  For  every  variable  X  and  set  of  of  variables  Y, 

Xy{u)  =  X  k  Xy{u)  =  x'  =»  X  =  x'  (2.20) 

Definition  10  (statement)  By  a  counterf actual  statement,  or  statement  for 
short,  we  denote  a  sentence  of  the  form  Yx{u)  =  y  for  a  specific  variable  Y  eV, 
a  specific  realization  x  of  X  CV,  and  a  specific  u  in  the  domain  ofU. 

Definition  11  (causal  ordering)  A  causal  ordering  Xi . . .  Xn  of  a  set  of  variables 
is  an  ordering  such  that  for  any  two  variables  X  —  Xi  and  Y  =  Xk  such  that 
i  <  k,  Xyz(u)  =  Xz(u),  where  Z  is  any  set  of  variables  not  including  X  or  Y . 

Clearly,  for  every  recursive  model  we  can  find  an  ordering  that  satisfies  the 
condition  of  Definition  11.  In  fact,  every  ordering  consistent  with  the  arrows 
of  the  causal  graph  G{M)  will  satisfy  this  condition.  A  system  in  which  the 
variables  are  indexed  along  a  specific  causal  ordering  will  be  called  a  causally 
ordered  system. 

Definition  12  (semantic  entailment)  Given  a  set  S  of  counterf  actual  statements, 
let  Ms  be  the  set  of  models  of  S ,  namely,  the  set  {mi, . . .  ,m„}  of  all  causal  models 
such  that  all  statements  in  S  hold  for  each  mi.  A  counterfactual  statement  a  is 
semantically  entailed  by  S,  written  5  f=  cr,  if  a  holds  in  each  mi  G  Ms. 

Definition  13  (syntactic  entailment)  Given  a  set  A  of  axioms,  a  set  of  counter- 
factual  statements  S  syntactically  entails  a  counterfactual  statement  a,  written 


36 


S  (’’}  if  cr  can  be  derived  from  S  using  repeated  applications  of  axioms  from  A 
together  with  the  rules  of  logic. 

Define  Ac  to  be  the  set  {composition,  effectiveness,  definiteness,  uniqueness, 
causal  ordering}.  We  want  to  show  that  all  statements  that  are  semantically 
entailed  by  S  are  also  syntactically  entailed  by  5,  namely,  that 

S\=(T  S\-Ac  O' 

It  is  enough  to  show  that  every  set  of  statements  S  that  is  consistent  with  Ac 
has  a  model.  To  see  that  this  is  sufficient  to  prove  the  completeness  of  Ac,  assume 
that  there  is  some  set  S  and  statement  p  :  Xz{u)  =  x  such  that  in  every  model 
consistent  with  5,  p  holds,  and  p  is  not  derivable  from  S  using  Ac-  Since  p  is  not 
derivable  from  S,  there  must  be  some  other  statement  p'  :  Xz{u)  =  x',x  ^  x\ 
such  that  S'  U  {p'}  is  consistent  with  Ac.  Since  in  every  model  consistent  with 
S,  Xz{u)  =  X  holds,  no  model  is  consistent  with  S  U  {p'}.  Thus,  if  Ac  is  not 
complete,  then  there  must  exist  some  set  S'  that  is  consistent  with  Ac,  and  has 
no  model.  Looking  at  the  contrapositive,  if  every  set  of  statements  S  that  is 
consistent  with  Ac  has  a  model,  then  Ac  is  complete. 

We  now  show  that  for  any  set  of  statements  S,  if  S  is  consistent  under  Ac  then 
S  has  a  model.  We  will  use  the  concept  of  a  maximally  consistent  set,  which  is  a 
standard  technique  used  to  prove  completeness  in  modal  logic  [FHM95].  Consider 
a  maximally  consistent  set  S*.  That  is,  a  superset  of  S  that  is  consistent  with 
Ac  such  that  any  superset  of  S*  is  not  consistent  with  Ac-  We  will  show  that 
there  is  a  causal  model  M  which  satisfies  every  statement  in  S*,  and  thus  satisfies 
every  statement  in  S.^ 

^We  thank  Joseph  Halpern  for  calling  my  attention  to  this  technique  which  simplifies  ap¬ 
preciably  the  completeness  proof  originally  reported  in  [GP97]. 


37 


Proof  (by  induction):  We  prove  that,  for  any  maximally  consistent  set  S*, 
there  exists  a  causal  model  M  which  satisfies  every  statement  in  S*,  by  induction 
on  the  number  of  variables  |y|  in  S*. 

Base  Case: 

If  |y|  =  1,  then  the  statements  X(u)  in  S*  determine  the  function  for  X,  and 
effectiveness  ensures  that  Xa;{u)  =  x  for  all  x  ^  X. 

Inductive  Case: 

Consider  the  variables  V  that  are  in  S*.  Let  Y  e  V  be  the  last  element  in  the 
causal  ordering.  Consider  the  set  S'*,  which  is  S*  with  all  statements  of  the  form 
=  y  and  Xy^{u)  =  x  removed.  By  the  inductive  hypothesis,  there  is  a 
model  M'  such  that  every  element  of  S'*  is  satisfied. 

We  now  extend  M'  to  M,  such  that  every  element  in  S*  is  satisfied  in 
M.  For  each  variable  X  e  M'  and  each  value  y  of  Y,  •■■■,Xk,y,u)  = 

w)-  We  define  fy  as  follows:  for  each  statement  (i^(u)  =  y)  €:  S* 
such  that  jZj  =  |I/|  —  1  and  Y  ^  Z,  fY[z,u)  =  y.  Definiteness  ensures  that  fy 
will  be  completely  determined. 

Since  M'  satisfied  all  elements  of  S'*,  and  given  the  causal  ordering  such  that 
Xyz{u)  =  Xz{u)  for  all  Xyz{u),  Xz{u)  in  S*,  M  satisfies  all  statements  of  the  form 
form  Xzlu)  in  S*. 

We  now  show  that  M  satisfies  every  element  of  S*  of  the  form  5^(u)  =  y.  We 
show  this  by  induction  on  the  size  of  |y|  —  |Z|. 

Base  Cases: 

(i)  F  6  Z.  By  effectiveness,  Yz{u)  =  y  is  in  M. 

(i)  |F|  -  \Z\  =  1.  By  construction  of  fy,  Y^{u)  =  y  =>  F  =  y  is  in  M^. 


38 


Inductive  Case: 


\V\-\Z\  =  k.  Consider  =  y',  where  a:  =  X,{u).  Above,  we  proved  that 

X^{u)  is  satisfied  in  M,  and  by  the  inductive  hypothesis,  Yzx{u)  =  y'  is  satisfied  in 
M.  Thus,  by  composition,  Yz{u)  =  y'  is  satisfied  in  M  and,  also  by  composition, 
y  =  y'.  Thus,  Yz{u)  =  y  is  satisfied  in  M.  □ 

Joseph  Halpern  [Hal97]  has  recently  shown  that  composition,  reversibility, 
effectiveness,  and  definiteness  are  complete  in  all  causal  models,  recursive  as 
well  as  nonrecursive,  as  long  as  the  uniqueness  assumption  holds.  He  further 
characterized  systems  in  which  uniqueness  does  not  hold,  which  using  a  more 
elaborate  type  of  axioms. 

2.8  Comparison  to  Lewis’s  Formalism 

We  now  compare  our  causal  model  framework  to  that  of  Lewis  [Lew73b],  to 
show  that  for  recursive  systems,  composition  and  effectiveness  are  sound  and 
complete  within  Lewis’s  framework.  We  give  here  a  version  of  Lewis’s  logic  for 
counterfactual  sentences  (from  [LewSl]). 


39 


Rules 


(1)  U  A  and  A  B  are  theorems,  so  is  B 

(2)  If  (Bi  iz  . . .)  (7  is  a  theorem,  then  so  is 

{iAD-^Bi)...)^{Aa-^C) 

Axioms 

(1)  All  truth-functional  tautologies 

(2)  AO-^A 

(3)  iAa-^B)k{Ba-^A)=^{AD-^C)  =  (Ba-*C) 

(4)  {{A  y  B)a^A)v  ((A  V  B)  B)  V  (((A  y  B)  C)  =  {A  C) 
k{BD-,  O) 

(5)  Aa-,B=^A^B 

(6)  AkB=^Aa-^B 

The  statement  AD— >5  stands  for  “In  all  closest  worlds  where  A  holds,  B  holds 
as  well.”  Lewis  does  not  put  any  restrictions  on  the  definition  of  closest  worlds, 
beyond  the  obvious  requirement  that  world  w  be  no  further  from  itself  than  any 
other  world  w'  ^  w.  In  essence,  causal  models  with  local  interventions  define  an 
ordering  among  worlds  that  gives  a  metric  by  which  to  define  what  worlds  are 


40 


closest.  As  such,  all  of  Lewis’s  axioms  are  true  for  causal  models  and  follow  from 
effectiveness,  composition,  and  (for  nonrecursive  systems)  reversibility. 

In  order  to  relate  Lewis’s  axioms  to  our  framework,  we  need  to  translate  his 
syntax  into  the  language  of  causal  models.  We  will  equate  Lewis’s  “world”  with 
an  instantiation  of  all  variables  in  a  causal  model,  including  the  variables  in  U. 
Propositions,  such  as  A  and  B  in  the  statements  above,  will  be  limited  to  the 
assignment  of  values  to  subsets  of  variables  in  a  model.  Thus,  the  meaning  of 
the  statement  AO-^Bin  causal  models  is  “If  we  force  a  set  of  variables  to  have 
the  values  A,  a  second  set  of  variables  will  have  the  values  5.”  Let  A  stand  for  a 
set  of  values  of  the  variables  Xi, . . . ,  and  let  B  stand  for  a  set  of 

values  y\,...,ym  of  the  variables  Fi, . . . ,  1^.  Then, 

AO-^B  =  =  yik 

—  y2  ^ 

(2.21) 

Conversely,  we  need  to  define  what  statements  such  as  F:(u)  =  y  mean  in 
Lewis  s  notation.  Let  A  stand  for  the  proposition  X  —  x^  and  B  stand  for  the 
proposition  Y  =  y.  Then, 


Yx{u)  =y  =  Ad-*  B  (2.22) 

We  can  now  examine  each  of  Lewis’s  axioms  in  turn. 


41 


(1)  Trivially  true. 


(2)  This  axiom  is  the  same  as  effectiveness.  Namely,  if  we  force  a  set  of  variables 
X  to  have  the  value  ar,  then  the  resulting  value  of  X  is  x.  That  is,  Xx{u)  = 

X. 

(3)  This  axiom  is  a  weaker  form  of  reversibility,  which  is  relevant  only  for 
nonrecursive  causal  models. 

(4)  Since  actions  in  causal  models  are  restricted  to  conjunctions  of  literals,  this 
axiom  does  not  apply.  However,  under  the  interpretation  do{A  V  B)  = 
do{A)  V  do{B),  this  axiom  does  hold. 

(5)  This  axiom  follows  directly  from  composition. 

(6)  This  axiom  follows  directly  from  composition. 

Likewise,  composition  and  effectiveness  follow  from  Lewis’s  axioms.  Compo¬ 
sition  is  a  consequence  of  Lewis’s  axiom  (5)  and  rule  (1),  while  effectiveness  is 
the  same  as  Lewis’s  axiom  (2).  Thus,  causal  models  do  not  add  any  restrictions 
to  counterfactual  statements  above  those  imposed  by  Lewis’s  framework,  when 
we  are  considering  recursive  models.  When  we  consider  nonrecursive  systems,  we 
see  that  reversibility  is  not  enforced  by  Lewis’s  framework.  Lewis’s  axiom  (3), 
while  similar,  is  not  as  strong  as  reversibility.  For  instance,  Y  =  y  may  hold  in 
all  closest  lo-worlds,  W  =  w  may  hold  in  all  closest  y-worlds  and,  still,  Y  =  y 
may  not  hold  in  our  world.  A  graphical  example  violating  reversibility  in  Lewis’ 
framework  is  given  in  Figure  2.5 


42 


Figure  2.5:  Example  of  the  failure  of  reversibility  in  Lewis’s  framework:  W  =  w 
holds  in  all  closest  y-worlds,  and  F  =  y  holds  in  all  closest  w-worlds,  yet  F  ^  F 


and  W 

2.9  Applying  Counterfactual  Derivation:  Example 

Consider  the  century-old  debate  about  the  effect  of  smoking  on  the  incidence  of 
lung  cancer.  According  to  many,  the  tobacco  industry  has  managed  to  block  anti¬ 
smoking  legislation  by  arguing  that  the  observed  correlation  between  smoking  (A") 
and  lung  cancer  (F)  could  be  explained  by  some  sort  of  carcinogenic  genotype 
(/7i)  that  involves  inborn  craving  for  nicotine.®  However,  according  to  the  Surgeon 
General  s  report  of  1964,  there  is  a  causal  link  between  smoking  and  lung  cancer 
that  is  mediated  by  the  accumulation  of  tar  deposits  in  a  person’s  lungs  ( Z).  The 
two  claims  are  combined  in  the  graphical  model  of  Figure  2.6,  which  represents 
causal  theories  having  the  following  structure: 

V  ={X  (Smoking),  F  (Lung  Cancer),  Z  (Tar  in  Lungs)  } 

U={Ux,U2}  (£/iX^2)p(„) 

a:  =  /i(t/i),  z  =  /2(a:, U2),  Y  =  /3(z, u^) 

The  graphical  model  embodies  several  assumptions.  The  absence  of  a  direct 
®For  an  excellent  historical  account  of  this  debate,  see  [SGS93,  pp.  291-302]. 


43 


link  between  X  and  Y  represents  the  assumption  that  the  effect  of  smoking 
cigarettes  )  on  the  production  of  lung  cancer  (F)  is  entirely  mediated  through 
tar  deposits  in  the  lungs  {Z).  The  lack  of  a  direct  link  between  and  U2  reflects 
the  assumption  that  even  if  a  genotype  is  aggravating  the  production  of  lung 
cancer,  it  nevertheless  has  no  effect  on  the  amount  of  tar  in  the  lungs  except 
indirectly,  through  cigarette  smoking. 

The  graph  conveys  in  fact  the  stronger  assumption  that  Ui  and  U2  are 
marginally  independent  ,  written  (Ui  JJ^  t^2)p(u),  which  is  represented  by  the 
absence  of  a  dotted  arc  connecting  U\  and  U2. 

To  demonstrate  how  we  can  assess  the  degree  to  which  cigarette  smoking  in¬ 
creases  (or  decreases)  lung  cancer  risk,  we  imagine  a  study  in  which  the  three 
variables,  X^  F,  and  Z ^  were  measured  simultaneously  on  a  large,  randomly  se¬ 
lected  sample  from  the  population.  From  such  data,  we  wish  to  assess  the  risk 
of  lung  cancer  (for  a  randomly  chosen  person  in  the  population)  under  two  hy¬ 
pothetical  policies:  smoking  (^  =  1)  and  refraining  from  smoking  (X  =  0).  In 
other  words,  we  wish  to  derive  an  expression  for  the  probability  of  the  causal  ef¬ 
fect  Yx  ,  P(Yx  =  y),  based  on  the  joint  distribution  P{x,  y,  z)  and  the  assumptions 
embedded  in  the  graphical  model.  These  assumptions  can  be  translated  into  the 
language  of  counterfactuals,  using  two  simple  rules  (see  [Pea95a,  p.  704]): 

Rule  1  Exclusion  restrictions.  For  every  variable  F  having  parents  PAy,  and  for 
every  set  of  variables  Z  disjoint  of  PA,-,  we  have 

Ypayi'^)  =  Ypayziu)  (2.23) 

Rule  2  Independence  restrictions.  If  Zi,...,Z/i  is  any  set  of  nodes  in  V  not  con- 


44 


nected  to  V  via  some  U  variable,  we  have 


JL  ’  •  •  •  ’  (2.24) 

Terms  not  parameterized  by  u,  such  as  those  in  Eq.  (2.24),  denote  random 
variables  induced  by  P(u). 


U, 

* 

0 

0 

Ua 

0 

i 

f 

• 

V  'v 

• — 

->• — 

X 

Z  Y 

Smoking 

Tar  in  Cancer 

Lungs 

Figure  2.6.  Causal  graph  illustrating  the  effect  of  smoking  on  lung  cancer 

Applying  these  two  rules,  we  see  that  the  causal  model  encodes  the  following 
assumptions: 


Zx{u)  = 

Zyz{u) 

(2.25) 

Xy{u)  = 

Xzy{u)  =  Xz{u)  =  a:(«) 

(2.26) 

Yx{u)  = 

Yzx{u) 

(2.27) 

X 

{Y.,X] 

(2.28) 

Eqs.  (2.25)-(2.27)  were  obtained  using  the  exclusion  restrictions  of  Eq.  (2.23). 
Eq.  (2.25),  for  instance,  represents  the  absence  of  a  directed  path  from  V  to  X, 
while  Eq.  (2.26)  represents  the  absence  of  a  causal  link  from  Z  or  F  to  X.  In 


45 


contrast,  Eq.  (2.28)  follows  from  the  independence  restriction  of  Eq.  (2.24)  and 
represents  the  lack  of  a  connection  between  (i.e.,  the  independence  of)  C/j  and 
U2. 

We  now  use  these  assumptions,  and  the  properties  of  composition  and  effec¬ 
tiveness,  to  compute  various  tasks: 

Task  1  Compute  P{Z^  =  z),  the  probabilistic  causal  effect  of  X  on  Z. 

P{Zx  =  z)  =  P{Zx  =  z\x)  from  Eq.  (2.28) 

=  P{Z  =  z\x)  by  composition  (2.29) 

=  P{z\x) 

Task  2  Compute  P{Y^  =  y),  the  probabilistic  causal  effect  of  Z  on  Y. 

P{y^  =  y)  =  T.Piy^  =  y\^)P{^)  (2.30) 

X 

and  since  Y^  j|  Zx\X, 

P{Y^  =  y\x)  =  P{Y,  =  y\x,  Zx  =  z)  from  Eq.  (2.28) 

=  P(Xz  =  z\x,z)  by  composition  (2.31) 

=  P{y\x,z)  by  composition 

Substituting  Eq.  (2.31)  in  Eq.  (2.30)  gives 

Pi^z  =y)  =  Y^  P{y\x,  z)P{x)  (2.32) 


46 


Task  3  Compute  P(l^  =  i/),  the  probabilistic  causal  effect  of  X  on  Y. 

For  any  variable  Z, 

Yx{u)  =  Yxz{u)^  if  Zx{u)  =  z  by  composition 

Since  Yi;z(u)  =  Yz{u)  (from  Eq.  (2.33)), 

Yxiu)  =  3^^^(u)  =  Yz{u)  when  =  Zx{u)  (2.33) 

Thus, 

P{Yx  =  y)  =  P{Yz,  =  y)  from  Eq.  (2.33) 

=  E.  PiYz^  =  y\Zz,  =  z)P{Z,  =  z) 

~  ^2  P{Yz  =  y\Zx  =  z)P[Zx  =  z)  by  composition 

=  E^  P(Yz  =  y)P{Z^  =  z)  from  Eq.  (2.28) 

(2.34) 

P{Yx  =  y)  and  P{Zz:  =  z)  were  computed  in  Eq.  (2.29)  and  Eq.  (2.32). 
Substituting  gives  us 

PiYx  =  y)  =  EPi^\^)EP(y\^^^')P(Y)  (2.35) 

^  x' 

The  right  hand  side  of  Eq.  (2.35)  can  be  computed  from  P(x,y,z). 

In  general,  the  reduction  of  counterfactual  probabilities  P{Yx  =  y)  to  expressions 
involving  probabilities  over  observed  variables  is  called  identifiability.  Our  com¬ 
pleteness  result  implies  that  any  identifiable  counterfactual  quantity  P{Yx  =  y) 
can  be  reduced  to  the  correct  expression  by  repeated  application  of  composition 
and  effectiveness. 


47 


2.10  Conclusion 


We  have  given  a  formal,  mathematical  description  of  causal  models,  as  well 
as  some  justification  for  our  model.  We  have  derived  a  set  of  properties  of 
causation  composition,  effectiveness,  and  reversibility — that  follow  from  the 
structural  model  formalism.  For  recursive  (and  causally  ordered)  models,  we 
have  shown  that  composition  and  effectiveness  are  complete. 

The  completeness  proof  for  composition  and  effectiveness  in  recursive  causal 
models  has  two  major  implications  for  recursive  systems.  First,  it  shows  that  the 
structural  interpretation  of  counterfactuals  adds  no  restrictions  beyond  those  of 
Lewis’s  closest-world  interpretation.  Second,  it  shows  that  the  very  general  lan- 
guage  of  Lewis  s  closest-world  framework  embodies  all  of  the  causal  restrictions 
on  counterfactuals  that  are  not  embodied  already  by  the  requirement  of  recur¬ 
siveness.  In  nonrecursive  systems,  however,  there  is  a  difference  between  the 
two  formalisms.  The  causal  reading  of  counterfactuals  imposes  the  additional 
restriction  of  reversibility. 

Moreover,  the  completeness  result  assures  us  that  a  proof  may  safely  be  at¬ 
tempted  with  two  axioms  only,  that  is,  all  truths  derivable  by  graph-based  anal¬ 
ysis  are  also  derivable  using  effectiveness  and  composition.  This  does  not  in  any 
way  diminish  the  usefulness  of  graphs  in  causal  analysis.  Graphs  play  an  essential 
role  in  knowledge  specification— graphical  specification  of  premises  followed  by 
translation  to  counterfactuals  is  more  natural  than  trying  to  articulate  premises 
directly  as  counterfactuals.  Graphs  may  also  assist  in  the  proof  procedure  and  in 
providing  independence  relations  (among  counterfactuals  and  visible  variables) 
that  are  not  easily  derived  symbolically.  Nevertheless,  we  have  shown  that  the 
power  of  symbolic  counterfactual  analysis  is  not  lower  than  that  of  graphical 


48 


analysis  whenever  the  system  is  causally  ordered.  In  nonrecursive  models,  this  is 
not  necessarily  the  case.  Attempts  to  evaluate  counterfactual  statements  using 
only  composition  and  effectiveness  may  fail  to  certify  some  statements  that  are 
true  in  all  causal  models,  and  whose  validity  can  only  be  recognized  through  the 
use  of  reversibility.  Whether  composition,  effectiveness,  reversibility  are  complete 
for  nonrecursive  as  well  as  recursive  systems  has  remained  an  open  question  until 
very  recently.  Halpern  [Hal97]  has  settled  the  problem  in  the  affirmative. 

We  have  also  shown  how  some  standard  causal  utterances  can  be  transformed 
into  the  structural  model  framework.  This  gives  each  of  these  terms  a  precise 
definition  which,  in  turn,  and  allows  us  to  reason  about  them  without  ambiguity. 


49 


CHAPTER  3 


Dynamic  Causal  Models 

3.1  Introduction 

One  problem  with  causal  models  is  that  they  do  not  allow  for  the  concept  of 
memory.  That  is,  each  variable  is  a  function  of  only  the  current  state  of  other 
variables  in  the  system,  and,  thus,  only  of  the  current  value  of  the  disturbances. 
The  past  cannot  affect  the  present.  Many  real-world  examples  require  the  concept 
of  previous  state  to  be  accurately  modeled.  For  instance,  a  standard  computer 
(which  has  an  internal  memory)  could  not  be  accurately  modeled  by  a  standard 
causal  model.  Even  less  complicated  systems  require  the  concept  of  past  state. 
Something  as  simple  as  a  ballpoint  pen  that  whose  point  extends  and  retracts 
with  a  the  push  of  single  button  is  difficult  to  model  with  a  standard  causal 
model.  Any  system  that  requires  the  concept  of  a  previous  state  will  not  be 
completely  definable  using  a  causal  model.  We  can  overcome  this  shortcoming 
by  adding  some  simple  extensions  to  causal  models. 

3.2  Causal  Models  with  Memory 

We  will  consider  each  variable  Xi  to  be  a  function  not  only  of  the  other  F’s  and 
U's,  but  also  of  the  previous  state  of  a  subset  of  variables  in  V.  This  subset  may 
include  the  previous  value  of  Xi  as  well  as  the  previous  values  of  the  parents  of  Xi. 


50 


Our  object  is  to  define  a  smooth  transition  between  full-blown  temporal  networks 
and  static  causal  models.  Often,  temporal  networks  offer  more  power  than  we 
need.  For  instance,  in  economic  models  of  supply  and  demand,  we  are  interested 
only  in  the  stable  state  of  the  variables,  not  in  transitory  effects.  By  removing 
these  transitory  effects  from  our  model,  we  allow  for  simpler  computations.  Thus, 
we  will  improve  the  expressive  power  of  causal  models  while  maintaining  the 
computational  savings  provided  by  the  approximations  of  static  systems. 

More  formally,  we  can  define  an  extension  to  causal  models  as  follows. 

Definition  14  (causal  model  with  memory)  A  causal  model  with  memory  is  a 
4 -tuple 

M=(V,V-,U,F) 

where 

(i)  V  =  is  a  set  of  endogenous  variables  determined  within  the 

system, 

(ii)  V-  =  {xr,...,Xn-}  is  a  set  of  previous  values  of  the  endogenous  variables 

V, 

(iii)  U  =  {U\, . . . ,  Urn]  is  a  set  of  exogenous  variables  that  represent  disturbances, 
abnormalities,  assumptions,  or  boundary  conditions,  and 

(iv)  F  =  {fxi}  is  a  set  ofn  deterministic,  nontrivial  functions,  each  of  the  form 

^»  = /x-i(pa,-,pa”,-,u)  i  =  l,...,n  pa,- C  F  \  X,-,  pa*,- C  F" 

We  will  assume  that  the  set  of  equations  in  (iv)  has  a  unique  solution  for 
X\, . . . ,  Xn,  given  any  value  of  the  disturbances  Ui,  and  past  values  V~ . 


51 


Thus  we  can  consider  each  variable  Y  to  be  a  function  of  the  disturbances  U  and 
the  past  values  ofV  in  the  causal  model  M:  Y  =  Ym{u,v-). 

As  with  standard  causal  models,  we  can  also  define  probabilistic  causal  models 
with  memory. 

Definition  15  (probabilistic  causal  model  with  memory)  A  probabilistic  causal 
model  with  memory  is  a  tuple 

{M,Piu)) 

where 

(i)  M  is  a  causal  model  with  memory  {V,V~,U,F) 

(ii)  P{u)  IS  a  probability  distribution  over  U,  such  that  each  element  Ui  E  U  is 
marginally  independent  of  all  other  elements  of  U. 

In  this  model,  time  is  discretized.  At  each  time  step,  the  state  of  the  world  is 
determined  by  the  previous  state  of  each  variable  together  with  the  current  state 
of  the  other  variables  in  the  system. 

Consider  a  ballpoint  pen  that  extends  and  retracts  its  point  with  a  single 
button.  This  item  would  be  diflScult,  if  not  impossible,  to  model  using  a  standard 
causal  model.  When  the  button  is  depressed  and  released,  the  current  state  of 
the  point  relies  not  on  its  current  causal  influences,  but  on  its  own  previous  state. 

Using  a  causal  model  with  memory,  however,  the  pen  can  easily  be  described,  as 
in  Figure  3.1. 

Notice  that  in  this  example,  pa",-  =  {X,'}.  We  next  consider  what  power  we 
lose  by  restricting  pa’.-  to  X,.  Is  there  any  system  that  cannot  be  modeled  with 
this  new  restriction? 


52 


=  t/l 


u, 


Button  {up, down} 


Interior  {upl,up2,downl,down2} 


^  Pen  Point  {up,down} 


'  up\ 

if  Xi  =  up  A  (Xf  =  upl  V  X2  =  down2) 

^2  =  < 

up2 

if  =  up  A  (X2  =  up2  V  X2  =  downl) 

downl 

if  Xi  =  down  A  (X2  =  downl  V  X2  =  upl) 

,  down2 

if  Xi  =  down  A  {X2  =  down2  V  X2  =  up2) 

^3  =  1 

'  up 

fX 

II 

•  down 

otherwise 

Figure  3.1;  Example  of  a  causal  model  with  memory 


53 


If  we  do  not  limit  the  size  of  the  domain  of  the  variable  involved,  we  can 
easily  transform  any  causal  model  with  memory  into  an  equivalent  model  such 
that  pa",  =  Xi  for  all  i.  This  can  be  done  by  combining  variables  into  “meta”- 
variables,  each  of  which  has  a  larger  domain.  For  instance,  if  a  variable  X4  has  X2 
m  pa~4,  we  could  combine  X4  and  X2  into  one  variable.  In  the  extreme  case,  all 
of  the  variables  in  V  can  be  combined  into  a  single  variable.  An  example  of  such 
a  combination  is  given  in  Figure  3.2.  If  the  variables  Xi, . . . ,  ^4  in  Figure3.2a 
are  all  binary,  the  variables  in  the  model  can  be  combined  to  obtain  Figure  3.2b, 
in  which  the  single  variable  Xi  that  has  2^  =  16  values. 

This  method  of  obtaining  a  causal  model  with  memory  such  that  pa",-  =  Xi 
suffers  from  two  major  disadvantages.  One  disadvantage  is  that  the  domains 
of  the  variables  grow  exponentially.  More  important,  by  combining  variables  in 
this  way,  we  lose  valuable  information  about  the  way  the  system  behaves.  We 
can  no  longer  access  information  on  how  the  variables  in  the  system  respond  to 
intervention,  which  is  precisely  why  we  developed  causal  models  in  the  first  place. 

We  can,  however,  solve  the  problem  without  losing  the  basic  structure  of  the 


54 


Figure  3.3:  A  causal  model  with  memory  which  contains  variables  whose  values 
depend  upon  the  past  history  of  other  variables. 


causal  model.  Given  any  arbitrary  causal  model  with  memory,  we  can  create  a 
new  causal  model  with  memory  such  that  for  all  Xi  G  V,  pa~-  =  Xf,  while 
maintaining  the  information  on  how  the  model  will  behave  under  interventions. 
We  do  this  by  increasing  the  domain  of  each  of  the  variables  to  include  previous 
state  information. 

For  example,  consider  the  causal  model  in  Figure  3.3,  which  contains  three 
binary  variables.  The  functions  for  X2  and  X3  both  rely  on  the  past  values  of 
other  variables.  We  can  create  an  equivalent  causal  model  with  memory  such  that 
each  variable  depends  only  upon  its  own  past  value  by  making  all  the  variables 
have  four  values  instead  of  two,  as  in  figure  3.4.  In  this  model,  the  high-order 
bit  of  each  variable  encodes  the  “past”  value  of  the  variable  in  the  origional 
model,  and  the  low-order  bit  encodes  the  “current”  value  of  the  variable  in  the 
origional  model.  In  the  functions  in  example,  MOD  stands  for  modular  division 


Figure  3.4:  A  causal  model  with  memory  which  contains  variables  whose  values 
do  not  depend  upon  the  past  histories  of  other  variables 


(so  X  MOD  2  extracts  the  low-order  bit  of  AT),  and  DIV  stands  for  integral 
division  (so  X  DIV  2  extracts  the  high-order  bit  of  X).  Likewise,  multiplying  by 
two  is  the  same  as  a  left  shift. 

We  can  make  this  transformation  for  any  causal  model  with  memory.  The 
number  of  bits  required  to  represent  each  variable  may  need  to  double,  but  the 
basic  causal  structure  of  the  new  model  will  be  the  same  as  that  of  the  origional 
model.  Thus,  limiting  pa  to  X^  does  not  limit  the  expressive  power  of  a  causal 
model  with  memory. 

3.3  Time-Series  Causal  Model 

A  logical  extension  of  adding  a  single  piece  of  memory  to  a  causal  model  is  to 
consider  a  time  series.  We  can  time-index  each  of  the  variables  in  V  and  U.  Thus, 


56 


V  will  stand  for  a  time  series,  Vt=i,Vt=2,  ■  •  ••  Likewise,  [/  will  be  a  time  series 
f4=i,  C4=2)  —  Each  function  /,•  will  map  a  subset  of  Vt-i  \JUtDVt  to  Xu.  Thus, 
we  will  be  able  to  consider  how  a  system  changes  over  time.  Formally: 

Definition  16  (time-series  causal  model)  A  time-series  causal  model  is  a  S-tuple 

M={V,U,F) 

where 


(i)  V  -  {Fi=i,  is  a  time  series  of  variables,  where  each  Vt  = 

{Xit,  X2t, . . . ,  Xnt)  is  a  set  of  endogenous  variables  determined  within  the 
system, 

(ii)  U  =  {Ut=\,Ut-2,  ■  -  is  a  time  series  of  variables,  where  each  Ut  = 

{Uu,U2t,..  •  bfmt}  is  a  set  of  exogenous  variables  that  represent  disturbances, 
abnormalities,  assumptions,  or  boundary  conditions,  and 

(iii)  F  =  {fi}  is  a  set  of  n  deterministic,  nontrivial  functions,  each  of  the  form 

Xi  =  Mp^h,  u)  ^■  =  1, . . . ,  n  pa-^  CVt\  Xi,  pab  C  Vt-i 

The  members  of  the  set  P A,  (connoting  parents)  are  often  called  the  direct  causes 
Xi  •  IVe  will  assume  that  each  variable  has  the  same  parent  set  over  time.  In 
addition,  we  often  will  make  the  assumption  that  P{ut)  =  P{utJr\)  for  all  t; 
this  assumption  is  not  necessary,  however.  We  will  also  assume  that  the  set  of 
equations  in  (iii)  has  a  unique  solution  for  Xu, ...  ,Xnt  given  any  values  of  the 
disturbances  ,  Um{t)  previous  values  Vt-i .  Thus,  we  can  consider  each 

variable  Y  to  be  a  function  of  the  disturbances  U  and  the  previous  values  of  V  in 
the  causal  model  M:  Yt  =  YMt{uu'^t-i)- 


57 


We  can  define  probabilistic  time-series  causal  model  as  follows. 

Definition  17  (probabilistic  time-series  causal  model)  A  probabilistic  time- 
series  causal  model  is  a  tuple 

(M,P(u)) 

where 

(i)  M  is  a  time-series  causal  model  {V,U,F),  and 

(ii)  P{u)  is  a  probability  distribution  over  U  such  that  each  element  Ui  6  U  is 
marginally  independent  of  all  other  elements  of  U. 

Time-series  causal  models  bear  a  strong  resemblance  to  dynamic  systems, 

particularly  the  area  of  discrete  iterations.  However,  there  are  some  significant 
differences. 

Consider  the  definition  of  discrete  iterations  as  defined  by  Robert  [Rob86a]: 

•  X  denotes  a  (usually  large)  finite  set  of  variables 

•  F  denotes  a  map  of  X  onto  itself 

•  a;°  denotes  an  initial  value  for  X 

Starting  with  the  initial  value  oi  X,  the  model  deals  with  the  sequence  of 
values  defined  by 

X^^^=F{x^)  r  =  0,1,2,... 

Since  X  IS  finite,  the  sequence  must  converge  to  a  value  such  that  either 
T’(0  =  L  or  converge  to  a  cycle  ^o,  •  •  • ,  ^  such  that 


58 


6  =  i"(6) 

6  =  Fi^p-i) 

6  =  FU,) 

In  the  model,  there  can  be  a  single  fixed  point  ^  such  that  for  any  value 
^f  ^  there  exists  a  p  such  that  for  all  i  ^  p,  ^  or  a  single  cycle 
Likewise,  the  model  could  inculde  several  cycles,  fixed  points,  or  a  mixture  of 
cycles  and  fixed  points. 

At  first  glance,  the  restriction  on  time-series  causal  models  that  each  set  of 
equations  F  must  have  a  unique  value  Xt  for  each  value  u  of  17  and  of  Xt-i 
is  overly  strong.  Indeed,  at  first  glance,  this  restriction  seems  similar  to  insisting 
that  a  discrete  iteration  system  have  a  single  fixed  point.  However,  restriction 
is  actually  more  akin  to  requiring  that  the  function  F  in  the  theory  of  discrete 
iteration  systems  have  a  unique  value  for  every  element  x  of  X. 

Consider  for  a  moment  what  relaxing  the  restriction  would  mean  for  causal 
models.  Given  a  set  of  boundary  conditions  and  the  previous  value  for  each 
variable,  there  would  be  multiple  possible  values  for  the  next  time  step.  This 
would  correspond  to  a  nondeterministic  world  in  which  the  nondeterminism  could 
not  be  fully  characterized  or  described.  Now,  consider  how  a  similar  situation 
plays  out  with  the  restriction  in  place.  Let  and  u  be  the  values  for  which 
there  could  be  more  than  one  value  of  Xf.  We  can  add  an  additional  variable 
Uj  to  U  which  determines  which  of  the  possible  values  xt  will  result  from  u  and 
xt-i-  With  the  restriction,  then.  We  can  describe  the  behavior  of  the  system 


59 


more  completely  that  we  could  with  the  restriction  relaxed;  we  do  not  have  to 
leave  part  of  the  causal  mechanism  undefined. 

This  example  illustrates  one  of  the  significant  differences  between  time-series 
causal  models  and  discrete  iteration  systems,  that  is,  a  time-series  causal  model 
has  boundary  conditions,  expressed  by  the  exogenous  variables  U,  while  a  discrete 
iteration  system  does  not 

3.4  Conclusion 

In  this  chapter,  we  proposed  causal  models  with  memory,  an  extension  to  causal 
models  that  implements  the  concept  of  previous  state.  This  extension  allow  us 
to  expand  the  range  of  systems  that  can  be  modeled  using  causal  models.  We 
also  proposed  a  time-series  causal  model.  This  model  acts  as  a  bridge  between 
full  dynamic  systems  an  causal  models:  it  combines  the  expressiveness  of  the 
former  with  the  compactness  and  ease  of  computation  of  the  latter.  Thus,  the 
benifits  of  each  formalism  can  be  combined  into  one  system.  Consider  modeling 
an  economic  system,  where  the  researcher  is  interested  in  how  changing  interest 
rates  over  time  will  effect  the  price  of  a  commodity.  If  the  changes  in  interest  rate 
are  slow  compared  to  the  fluctuations  of  the  price  and  quantity  of  the  commodity, 
then  the  researcher  will  be  able  to  combine  the  standard  static  equilibrium  supply 
and  demand  equations  with  more  dynamic  elements.  Thus,  the  researcher  needs 
only  model  the  dynamic  behavior  of  the  elements  for  which  the  transient  effects 
are  important,  without  giving  up  the  power  to  model  such  dynamic  systems. 


60 


CHAPTER  4 


Causal  Relevance 


4.1  Introduction 

Geiger,  Verma,  and  Pearl  [GVP90a]  have  developed  a  set  of  axioms  for  a  class  of 
relations  called  graphoids.  These  axioms  characterize  informational  relevance^ 
among  observed  events  based  on  the  semantics  of  conditional  independence  in 
probability  calculus.  This  chaper  develops  a  parallel  set  of  axioms  for  causal  rel¬ 
evance,  that  is,  the  tendency  of  certain  events  to  affect  the  occurrence  of  other 
events  in  the  physical  world,  independent  of  the  observer-reasoner.  Informational 
irrelevance  is  concerned  with  statements  of  the  form  is  conditionally  indepen¬ 
dent  of  Y  given  Z,”  which  means  that,  given  the  value  of  Z,  gaining  information 
about  X  gives  us  no  new  information  about  Y.  Causal  irrelevance  is  concerned 
with  statements  of  the  form  “X  is  causally  irrelevant  to  Y  given  Z,”  which  we 
take  to  mean  “Changing  X  will  not  alter  the  value  of  Y,  if  Z  is  fixed.” 

The  notion  of  causal  relevance  has  its  roots  in  the  philosophical  works  of  Good 
[G006I],  Suppes  [Sup70]  and  Salmon  [Sal84].  They  have  attempted  to  give  prob¬ 
abilistic  interpretations  to  cause-effect  relationships,  and  to  distinguish  causal 
from  statistical  relevance.  Although  their  attempts  have  not  produced  an  algo- 

^The  term  “relevance”  will  be  used  primarily  as  a  generic  name  for  the  relationship  of  being 
relevant  or  irrelevant.  It  will  be  clear  from  the  context  when  “relevance”  is  intended  to  negate 
“irrelevance.” 


61 


rithmic  definition  of  causal  relevance,  they  did  generate  methods  for  testing  the 
consistency  of  relevance  statements  against  a  given  probability  distribution  and  a 
given  temporal  ordering  among  the  variables  [Car89,  Eel91,  Pea96b].  This  chap¬ 
ter  aims  at  axiomatizing  relevance  statements  in  themselves,  without  reference 
to  underlying  probabilities  or  temporal  orderings. 

Axiomatic  characterization  of  causal  relevance  may  serve  as  a  normative  stan¬ 
dard  for  theories  of  action  as  well  as  a  guide  for  developing  representation  schemes 
graphical  models)  for  planning  and  decision-making  applications.  For  ex¬ 
ample,  instead  of  explicitly  storing  all  possible  effects  of  an  action,  as  in  STRIPS 
[FN72],  such  representation  schemes  should  enable  an  agent  to  examine  only  di¬ 
rect  effects  of  actions  and  to  infer  which  actions  are  relevant  for  a  given  goal,  and 
which  cease  to  be  relevant  once  other  actions  are  implemented. 

An  axiomization  of  causal  relevance  could  also  be  useful  to  experimental  re¬ 
searchers  in  domains  where  exact  causal  models  do  not  exist.  If  we  know,  through 
experimentation,  that  some  variables  have  no  causal  influence  on  others  in  a  sys¬ 
tem,  we  may  wish  to  determine  whether  other  variables  can  gain  such  causal 
influence  under  different  experimental  conditions,  or  we  may  want  to  discover 
what  additional  experiments  could  provide  such  information.  For  example,  sup¬ 
pose  we  find  that  a  rat’s  diet  has  no  effect  on  tumor  growth  while  the  amount 
of  exercise  is  kept  constant  and,  conversely,  that  exercise  has  no  effect  on  tu¬ 
mor  growth  while  diet  is  kept  constant.  We  would  like  to  be  able  to  infer  that 
controlling  only  diet  (while  paying  no  attention  to  exercise)  would  still  have  no 
influence  on  tumor  growth.  A  more  subtle  inference  problem  is  whether  chang¬ 
ing  cage  temperature  could  have  an  effect  on  the  rat’s  physical  activity,  having 
established  that  temperature  has  no  effect  on  activity  when  diet  is  kept  constant 


62 


and  that  temperature  has  no  effect  on  (the  rat’s  choice  of)  diet  when  activity  is 
kept  constant. 

We  provide  two  formal  definitions  of  causal  irrelevance,  a  probabilistic  defi¬ 
nition  and  a  deterministic  definition.  The  probabilistic  definition,  which  equates 
causal  irrelevance  with  inability  to  change  the  probability  of  the  effect  variable, 
has  intuitive  appeal  but  is  inferentially  very  weak;  it  does  not  support  a  very  ex¬ 
pressive  set  of  axioms  unless  further  assumptions  are  made  about  the  underlying 
causal  model.  If  we  add  the  stability  assumption  (i.e.,  that  no  irrelevance  can  be 
destroyed  by  changing  the  nature  of  the  individual  processes  in  the  system),  then 
we  obtain  a  set  of  axioms  for  probabilistic  causal  irrelevance  that  is  the  sane  as 
the  set  governing  path-interception  in  directed  graphs.  The  deterministic  defini¬ 
tion,  which  equates  causal  irrelevance  with  inability  to  change  the  effect  variable 
(in  any  state  of  the  world),  allows  for  a  richer  set  of  axioms  without  any  assump¬ 
tions  about  the  causal  model  being  made.  All  of  the  path-interception  axioms  for 

directed  graphs,  with  the  exception  of  transitivity,  hold  for  deterministic  causal 
irrelevance. 

4-2  Probabilistic  Causal  Irrelevance 

The  existence  of  a  probability  distribution  over  all  of  the  variables  in  a  causal 

model  (Eq.  (2.4))  leads  to  a  natural  definition  of  a  probabilistic  version  of  causal 
irrelevance. 

Definition  18  (probabilistic  causal  irrelevance)  X  is  probabilistically  causally 
irrelevant  to  Y,  given  Z,  written  {X  yA  F|Z)p,  iff 


63 


Vx,  x',  y,  z  P{y\z.  x)  =  P{y\z,  x') 


Read:  “Once  we  hold  Z  fixed  (at  z),  changing  X  between  any  two  values  will  not 
affect  the  probability  ofY.” 

4.2.1  Comparison  to  Informational  Relevance 

If  we  remove  the  hats  from  Definition  18,  we  get  the  standard  definition  of 
conditional  independence  in  prob  .oility  calculi  ■  ,  denoted  (XlYlZ),  which  is 
governed  by  the  graphoid  axioms  [PP87,  GVP90b]  given  in  Figure  4.1,  Dawid 
[Daw79]  and  Spohn  [Spo80]  introduced  different  forms  of  these  axioms,  and  Pearl 
and  Paz  [PP87]  conjectured  that  these  axioms  were  complete.  Studeny  [Stu92] 
refuted  This  conjecture  and  proved  that  conditional  independence  in  probability 
theory  has  no  finite  axiomatization.  Nevertheless,  the  graphoid  axioms  cap¬ 
ture  the  most  important  features  of  informational  relevance:  “Learning  irrele¬ 
vant  information  should  not  alter  the  relevance  status  of  other  propositions  in 
the  system;  what  was  relevant  remains  relevant,  and  what  was  irrelevant  remains 
irrelevant”  [Pea88]. 

One  of  the  salient  differences  between  informational  and  causal  relevance  is  the 
property  of  symmetry,  axiom  1.1.  Informational  relevance  is  symmetric,  namely, 
if  ^  is  relevant  to  Y ,  then  Y  is  relevant  to  X  as  well.  For  example,  learning 
whether  the  sprinkler  is  on  provides  information  on  whether  the  pavement  is  wet, 
and,  vice  versa,  learning  whether  the  pavement  is  wet  provides  information  on 
whether  the  sprinkler  is  on.  This  property  is  clearly  violated  in  causal  models: 
turning  a  sprinkler  on  tends  to  make  the  pavement  wet,  so  turning  on  the  sprinkler 
gives  us  information  about  the  state  of  the  pavement;  conversely,  wetting  the 


64 


1.1  (Symmetry)  {X±Y\Z)  {YLX\Z)  ^  ^ 

1.2  (Decomposition)  {Xl.YW\Z)  {XLY\Z) 

1.3  (Weak  Union)  {X±YW\Z)  {XYY\ZW) 

1.4  (Contraction)  iX±Y\Z)  k  {X±W\ZY)  (XlYWlZ) 

1.5  (Intersection)  {XYW\ZY)  k  {X1Y\ZW)  {XLYW\Z) 

_ Ij^I^rsection  requires  a  strictly  positive  probability  distribution. 

Figure  4.1:  The  graphoid  axioms 

pavement  has  no  physical  effect  on  the  state  of  the  sprinkler  and  gives  us  no 
information  about  whether  the  sprinkler  was  on  or  off. 

Another  basic  difference  between  informational  and  causal  relevance  is  that 
in  the  former,  the  rule  of  the  hypothetical  middle  [Pea88,  p.  17]  always  holds: 

MIN.P(t/|x)  <  P{y)  <  MAX^P(t/|a:)  (4.1) 

In  causal  relevance,  P{y)  might  be  greater  than  MAX^P(j/|®)  or  less  than 
MIN^P(j/|;r).  Figure  4.2  illustrates  such  a  possibility. 

In  Figure  4.2,  there  are  two  endogenous  variables  X  and  Y,  as  well  as  an 
exogenous  variable  Ui.  Without  any  intervention,  X  will  always  have  the  same 
value  as  t/i,  and  thus,  Y  will  have  the  value  1.  If  AT  and  Ui  have  different 
values,  however,  then  Y  will  have  the  value  0.  If  we  intervene  and  set  W  =  1, 
then  Y  will  have  the  value  1  when  I7i  =  1,  which  has  a  probability  0.5,  and 
Y  will  have  the  value  0  when  Pi  =  0,  which  has  a  probability  0.5:  P(Y  — 
0lsei(X  =  1))  =  P(Y  =  l|set(A'  =  1))  =  0.5.  Similarly,  we  can  see  that 
P(Y  =  0lset(X  =  0))  =  P(Y  =  llset(X  =  0))  =  0.5.  Thus,  MAX^P(yjx)  =  0.5, 


65 


u, 

X  =  U\ 

X 

V  =  {X,Y}  binary 

f  1  if  X  =  Ui 

y  =  < 

\\ 

f 

U  =  {Ui}  binary 

^  0  otherwise 

P{ui)  =  0.5 

Figure  4.2;  Example  of  P{y)  >  MAX2;P(j/|;c) 

and  P(r  =  1)  =  1  >  0.5  =  MAX^P(t/|x). 

Note  that,  given  this  violation  of  the  rule  of  the  hypothetical  middle  (Eq. 
(4.1)),  Definition  18  is  not  equivalent  to 

'ix,y,z  P{y\z,x)  =  P{y\z)  (4.2) 

Read:  “Once  we  hold  Z  fixed  (at  z),  controlling  X  will  not  affect  the  probability 
of  V .”  In  fact,  Eq.  (4.2)  is  stronger  than  Definition  18;  furthermore,  statement 
2.5.2  (left-intersection  of  Theorem  6,  below)  follows  from  the  former  but  not  from 
the  latter. 

The  notion  of  probabilistic  causal  irrelevance  may  bring  to  mind  the  concept 
ignorability  [RR83]  which  is  extremely  important  in  analyzing  the  effectiveness 
of  treatments  (e.g.,  drugs,  diet,  educational  programs)  from  uncontrolled  studies. 
The  two  concepts  are  related  but  different.  Ignorability  allows  us  to  ignore  how  X 
obtained  its  value  x,  while  irrelevance  allows  us  to  ignore  which  value  X  actually 
obtained.  Ignorability  is  defined  as  the  condition 

P{Y,  =  y\z)^P{Y  =  y\z,x)  (4.3) 


66 


which  implies 


P{y\x)  -  P{Yx  =  y)  =  E:,{y\z,x)  (44) 

Thus,  ignorability  allows  an  investigator  to  relate  the  potential  response  Yx  to 
observable  conditional  probabilities.  Central  in  experimental  design  is  the  ques¬ 
tion  of  how  to  select  a  set  of  observables  Z  that  would  make  Eq.  (4.3)  true,  given 
causal  knowledge  of  the  domain.  Ignorability  in  itself  does  not  provide  such  a 
criterion,  although  it  does  state  the  problem  in  formal  counterfactual  language: 
“Z  can  be  selected  if,  for  every  x,  the  value  that  Y  would  obtain  had  X  been  x 
is  conditionally  independent  of  X,  given  Z.”  A  criterion  for  selecting  Z  can  be 
obtained  from  the  graph  G{M)  underlying  a  causal  model  (e.g.,  the  back-door 
criterion  in  [Pea95a]). 

The  question  we  attempt  to  answer  in  this  section  is  whether  the  relation 
of  causal  irrelevance,  {A  7A  B\C)p,  is  governed  by  a  set  of  axioms  similar  to 
the  set  of  axioms  governing  the  relation  of  informational  irrelevance,  {ALB\C). 
More  generally,  one  may  ask  whether  there  are  any  constraints  that  prohibit  the 
assignment  of  arbitrary  functions  P{y\x)  to  any  pair  (AT,  Y)  of  variable  sets  in  V, 
in  total  disregard  of  the  fact  that  P{y\x)  represents  the  probability  of  (Y  =  y) 
induced  by  physically  setting  X  —  x  m  some  causal  model  M.  Our  finding 
indicate  that,  although  it  is  not  totally  arbitrary,  the  assignment  P{y\x)  is  only 
weakly  constrained  by  qualitative  axioms  such  as  those  in  Figure  4.1. 

4.2.2  Axioms  of  Probabilistic  Causal  Irrelevance 

We  have  found  only  two  qualitative  properties  that  constrain  probabilistic  causal 
irrelevance. 

Theorem  6  For  any  causal  model,  the  following  two  properties  must  hold: 


67 


V2£S 


2.2.1  (Right-Decomposition)  {X  YW\Z)p  =^(X  Y\Z)p  &  (X  />  W\Z)p 

2.5.2  (Left-Intersection)  {X  Y\ZW)p  k{W  Y\ZX)p  {XW  Y\Z)p 

Property  2.2.1  reads:  “If  changing  X  has  no  effect  on  Y  and  W  considered 
jointly,  then  it  has  no  effect  on  either  Y  or  W  considered  separately.”  This  follows 
trivially  from  the  fact  that  P(-)  is  a  probability  function,  but  it  does  not  reflect 
any  quality  of  causation. 

Property  2.5.2  reads:  “If  changing  X  cannot  affect  P{y)  when  W  is  fixed, 
and  changing  W  cannot  affect  P{y)  when  X  is  fixed,  then  changing  X  and  W 
together  cannot  affect  P{y).^' 

Many  seemingly  intuitive  properties  do  not  hold.  For  instance,  none  of  the 
following  statements  holds  for  all  causal  models. 

2.2.2  (Left-Decomposition-1)  (XW  -f^  YjZjp  (X F|Z)p  V  (W  YjZjp 

2.2.3  (Left-Decomposition-2)  (XW  Y\Z)p  (X  F|Z)p  y  (X  WlZ)p 

2.2.4  (Left-Decomposition-3) 

(XW  74  F|Z)p  A  (XY  74  W\Z)p  ^  (X  74  Y\Z)p  V  (X  74  WlZ)p 

2.3  (Weak  Union)  (X  -f^  WYlZ)p  (X  Y\ZW)p 

2.4  (Contraction)  (X  Y\Z)p  A  (X  74  W\ZY)p  (X -/-^  WY\Z)p 

2.5.1  (Right-Intersection)  (X  -f*  Y\ZW)p  K  (X  -f^  W\ZY)p=ir(X  WY\Z)p 
2.6  (Transitivity)  (X  -f^  Y\Z)p=^(a  -f^  Y\Z)py(X  -f^  a\Z)p  Va  ^  XUZUY 

The  sentences  above  were  tailored  after  the  graphoid  axioms  (Figure  4.1)  with 
the  provision  that  symmetry  does  not  hold,  thus  requiring  left  and  right  versions. 


68 


Many  of  these  sentences  have  intuitive  appeal  and  yet  are  not  sound  relative  to 
the  semantics  of  P{y\x).  For  example;  left-decomposition  states  that  if  changing 
X  has  an  effect  on  K,  and  changing  W  has  an  effect  on  Y,  then  changing  X 
and  W  simultaneously  should  also  affect  Y.  It  is  hard  to  find  a  simple  real-life 
example  that  refutes  this  assertion.  Still,  as  will  be  shown  in  the  examples  of 

Section  4.3.1  and  in  Appendix  A,  each  of  these  sentences  is  refuted  by  some 
specific  causal  model. 

4.3  Proofs  of  Axioms  of  Probabilistic  Causal  Irrelevance 

We  now  prove  the  two  sentences  of  Theorem  6. 

2.2.1  Holds  trivially.  {X  YW\Z)p  =>  P{yw\z,x)  =  P{yw\z).  We  can  sum 

over  W  to  get  P(ylz,x)  =  P(yjz),  which  implies  (X  Y\Z)p.  The  case 
for  {X  W\Z)p  is  symmetric  □ 

2.5.2  (By  Contradiction)  Assume  (AT  yA  Y\ZW)p 

A  (W  74  Y\ZX)p  A  ^{XW  74  Y\Z)p.  Since  ^{XW  Y\Z)p,  by 
definition  3y,x,x\w,w',z  P{y%w,x)  ^  P{y\z,w' ,x').  However, 
{X  74  Y\ZW)p  implies  Vi/,x,  a;',  z,u;  P{y\z,x,w)  =  P{y\z,x' ,w).  Further¬ 
more,  {W  -f^Y\ZX)p  implies  \/y,x',w,w',z  P(y\z,x',w)  =  P{y\z,x',w') 
So,  Vx,x',w,w\z  P{y\z,x,w)  =  P(y\z,x\w)  =  P(y\z,x',w').  Thus 
Vx,  X  ,w,w  ,z  P{y\z,  X,  w)  =  P{y\z,  x',  w'),  which  contradicts  3x,  x',  w,  w',  z 
P{y\z,x,w)^P{y\z,x',w').  □ 


69 


4.3.1  Counterexample  to  Property  2.2.2 

We  now  disprove  property  2.2.2  by  counterexample.  This  counterexample  is  not 
necessarily  meant  to  model  a  common,  real-life  situation.  Rather,  it  disproves 
the  claim  that  all  possible  causal  models  conform  to  the  property. 

2.2.2  {XW  Y\Z)p  (X  74  riZ)p  V  (W  VjZ)p 


Figure  4.3  shows  a  counterexample  to  this  sentence.  In  this  model, 
(XW  74  F|0)p  &  74  T|0)p  &  ->(W  74  y'|0)p.  The  contrapositice  form 

of  this  counterexample  states  that  changing  W  can  affect  the  probability 
of  Y ,  and  changing  X  can  affect  the  probability  of  F,  but  changing  W  and 
X  simultaneously  has  no  effect  on  the  probability  of  Y ,  This  is  extremely 
counterintuitive;  if  tweaking  X  has  an  effect  on  Y,  and  tweaking  W  has  an 
effect  on  F,  we  would  expect  the  more  flexible  option  of  changing  X  and 
W  simultaneously  to  also  affect  F .  The  key  to  this  counterexample  is  the 
fact  that  setting  W  removes  the  connection  between  W  and  Ui.  When  we 
intervene  on  only  X,  W  takes  on  the  same  value  as  Ui,  and  F  will  always 


70 


have  the  value  of  X.  When  we  intervene  on  both  X  and  W,  there  is  no 
longer  any  connection  between  and  W.  Thus,  the  probability  that  W 
and  Ui  will  have  the  same  value  is  0.5,  and  P{y)  =  0.5 

Counterexamples  to  the  other  six  properties  that  do  not  hold  for  all  causal 
models  are  in  Appendix  A. 

4.3.2  Numeric  Constraints 

Although  Definition  18  imposes  only  weak  constraints  (axioms  2.2.1  and  2.5.2) 
on  the  structure  of  probabilistic  causal  irrelevance,  the  probability  assignments 
P{y\x),  which  describe  the  effects  of  actions  in  the  domain,  are  constrained  nev¬ 
ertheless  by  nontrivial  numerical  bounds.  For  instance,  the  inequality 

iy\x,z)>P{y,z\x)  (45) 

must  hold  in  any  causal  model.  This  can  easily  be  shown  by  the  definitions  of 
P{y^z\x)  and  P(y\x,z).  Recall  from  Eq.  (2.4)  that 

P{y,z\x)=  ^  p(^u) 

{u  I  n(u)=!,  &  Z4u)=z} 

and 

P{y\x,z)=  ^  P(u) 

{u  I  Yxz{u)=y} 

Consider  1/^^,  the  set  of  all  values  u  oi  U  such  that  Yx{u)  =  Y  and  Z^{u)  =  z, 
and  the  set  of  all  values  u'  of  U  such  that  Yx^{u')  =  y.  Since  all  values 

u  of  already  constrain  Z  to  have  the  value  fixing  Z  at  z  will  not  affect 

the  value  of  Y.  Thus,  for  all  values  u  of  Y^ziu)  =  y.  Hence,  D 
and  P(y\x,z)  >  P(^yz\x).  This  can  be  shown  more  formally  using  Theorem  1 
(Theorem  1  is  proven  in  Section  2.7).  Additional  constraints  were  explored  in 
[Pea95b]. 


71 


4.3.3  Axioms  of  Causal  Relevance  for  Stable  Models 


The  set  of  axioms  we  obtained  for  causal  irrelevance  is  much  smaller  than  we 
would  expect  from  our  intuition  about  cause-effect  relations.  We  have  two  ex¬ 
planations  for  this  discrepancy.  One  possibility  is  that  our  intuition  of  causal 
relevance  is  based  on  a  deterministic  rather  than  a  probabilistic  conception  of 
physical  reality.  This  possibility  will  be  explored  in  Section  4.4,  where  we  give  a 
deterministic  definition  of  causal  irrelevance  that  yields  a  more  complete  set  of 
axioms.  The  other  possibility  is  that  the  type  of  examples  exploited  in  Section 
4.3.1  and  Appendix  A  are  not  commonly  observed  in  everyday  life.  In  this  section, 
we  explore  what  assumptions  need  to  be  made  for  probabilistic  causal  irrelevance 
to  acquire  properties  that  we  intuitively  associate  with  causal  irrelevance. 

A  more  expressive  set  of  causal  relevance  axioms  is  obtained  if  we  confine 
the  analysis  to  stable  causal  models,  that  is,  to  causal  models  whose  irrelevances 
are  implied  by  the  structure  of  the  causal  model  and,  hence,  remain  invariant  to 
changes  in  the  forms  of  each  individual  function  /;.  Our  definition  of  stability 
employs  the  concept  of  a  replacement  class.  A  replacement  class  r  is  the  set  of  all 
models  that  have  the  same  variables  V  and  U,  and  the  same  functional  arguments. 
In  other  words,  the  functions  are  allowed  to  change  between  members  of  r,  but 
the  arguments  of  these  functions  are  not  allowed  to  vary.  Formally,  for  any  two 
models  Mi,  M2  €  r  and  any  two  functions  fi{PAi)  G  Mi  and  fl{PA\)  6  M2, 
PAi  =  PA\.  The  class  r(M)  represents  the  replacement  class  that  contains  the 
model  M. 

We  now  define  stability  using  replacement  classes  (see  also  [PV91]^). 

^The  probabilistic  notion  of  stability  (also  called  “DAG-isomorphism,”  “nondegeneracy” 
[Pea88,  p.  391],  and  “faithfulness”  [SGS93])  weis  used  by  Pearl  and  Verma  [1991]  to  emphasize 
the  invariance  of  certain  independencies  to  functional  form. 


72 


Definition  19  (stability)  Let  M  be  a  causal  model.  An  irrelevance  {X  Y\Z)p 
in  M  IS  stable  if  it  is  shared  by  all  models  in  t{M).  The  model  M  is  stable  if  all 
of  the  irrelevances  in  M  are  stable. 

Stability  requires  that  irrelevance  be  determined  by  the  structure  of  the  equa¬ 
tions,  not  merely  by  the  parameters  of  the  functions.  Thus,  a  causal  model  is  not 
stable  if  we  can  remove  an  irrelevance  relationship  by  replacing  an  equation  or 
set  of  equations  to  obtain  a  new  model  with  fewer  irrelevance  statements.  In  each 
of  the  examples  in  Section  4.3.1  and  Appendix  A,  for  instance,  a  minor  change 
in  the  form  of  one  of  the  equations  would  destroy  an  irrelevance.  None  of  the 
models  presented  in  Figure  4.3  and  the  appendix  is  stable. 

There  are,  however,  many  stable  causal  models.  All  monotonic  linear  systems, 
for  example,  are  stable.  One  might  think  that  any  causal  model  that  contains 
only  additive,  monotonic  functions  /,•  would  be  stable.  The  causal  model  of  Figure 
A. 7  refutes  that  conjecture. 

Definition  20  (path-interception)  Let  {X -*^Y\Z)g  stand  for  the  statement  “Ev¬ 
ery  directed  path  from  X  to  Y  in  graph  G  contains  at  least  one  element  in  Z.” 

Theorem  T  If  a  causal  model  M  is  stable,  then  X  is  probabilistically  causally 
irrelevant  to  Y,  given  Z,  in  M  iff  Z  intercepts  all  directed  paths  from  X  to  Y  in 
the  graph  G{M)  defined  by  M.  That  is, 

7^  {^-**Y\Z)g(M) 

Proof: 

(i)  {X  7^  YlZ)p  (-^-^YIZ)g(m) 


73 


Assume  that  there  exists  a  stable  causal  model  M  that  induces  a  probabilistic 
causal  irrelevance  relation  (A  B\C)p^  and  assume  that,  for  some  sets  of  vari¬ 
ables  X,  F,  Z,  {X  7^  Y\Z)p  and  ~'{X -*^Y\Z)of^M)-  Since  there  is  a  directed  path 
from  X  to  Y  that  is  not  intercepted  by  Z  in  G{M),  we  can  easily  construct  a 
model  M'  such  that  G{M')  =  G{M)  and  ^{X  Y\Z)p  in  M'.  We  can  do  this 
by  changing  all  of  the  functions  that  lie  on  the  path  from  X  to  F  to  disjunctions 
and  then  modifying  the  other  functions  to  ensure  that  P{y\z)  <  1.  Thus,  if  we 
force  X  to  have  the  value  1,  F  will  also  have  the  value  1,  and  P{y\z,x)  ^  P{y\z). 
By  assumption,  (^X  YjZ)p,  so  an  irrelevance  in  Af  is  not  shared  in  a  member 
of  Thus,  M  is  not  a  stable  causal  model,  a  contradiction. 

(“)  ==>  {X  74  F|Z)p 

We  will  use  the  following  lemma: 

Lemma  1  For  any  structural  equation  fy  in  a  causal  model  M,  if  a  series  of 
functional  substitutions  results  in  a  new  function  Qy  such  that  X  is  an  argument 
of  Qy,  then  there  must  be  a  directed  path  from  X  to  Y  in  G{M). 

We  will  prove  this  lemma  by  induction  on  the  number  of  functional  substitutions. 

Base  Case:  If  we  make  no  substitutions  into  fy,  then  every  argument  X  of  fy 
must  be  a  parent  of  F  in  G{M),  by  our  definition  of  G{M).  Thus,  there  is  a 
directed  path  from  each  argument  of  fy  to  F  in  G{M) 

Inductive  Case:  Assume  that  n  -  1  functional  substitutions  into  fy  always 
results  in  the  new  function  Qy  such  that  for  each  argument  X  oi  Py,  there  is  a 
directed  path  from  A"  to  F  in  G{M).  We  use  this  assumption  to  prove  that  after 
n  substitutions  resulting  in  gy,  there  is  a  directed  path  from  every  argument  of 
g'y  to  F  in  G{M),  as  follows:  When  we  do  a  single  substitution,  we  replace  a 


74 


variable  with  a  function  of  its  parents  in  G{M).  So,  for  any  new  argument  JiT' 
that  is  introduced  into  g'y  by  substituting  in  for  X,  X'  must  be  a  parent  of  in 
G{M).  By  the  inductive  hypothesis,  there  must  be  a  directed  path  from  X  to  F 
in  G{M).  Thus,  there  must  be  a  directed  path  from  X'  to  V  in  G(M). 

We  can  now  prove  the  implication  (X^VfZ)G(M)  (X F|Z)p.  We  will 
consider  fy,  the  functional  equation  for  F  in  M^.  After  we  do  a  functional 
substitution  for  all  variables  in  fy  except  and  Z,  we  are  left  with  a  new  function 
gy.  By  Lemma  1,  since  there  is  no  directed  path  from  AT  to  F  in  G(M^),  X  is  not 
an  argument  of  gy,  so  gy  is  a  function  of  only  Z  and  U.  Since  gy  is  a  function 

of  only  ^  and  U,  and  not  of  X,  F,,(u)  =  F(u);  hence,  P{y\x,z)  =  P{y\z)^  and 
(X  />  F|Z)p.  ^ 

Since  {X  74  F|Z)p  {X-^Y\Z)g(m)  in  stable  causal  models,  probabilistic 
causal  irrelevance  is  completely  characterized  by  the  axioms  of  path  interception 
in  directed  graphs.  A  complete  set  of  such  axioms  was  developed  in  [PP94, 
PPU96]  and  is  given  in  Figure  4.4. 


3.2.1  (Right-Decomposition)  {X-^YW\Z)g  =>  {X^Y\Z)g  &  {X^W\Z)g 

3.2.2  (Left-Decomposition)  {XW -*^Y\Z)g  =4>  {X-*^Y\Z)g  &  {W-^Y\Z)g 
3.4  (Strong  Union)  {X-*^Y\Z)g  {X^Y\ZW)g  V  W 

3.5.1  (Right-Intersection)  {X-^Y\ZW)g  &  (X-^W\ZY)g  {X-*^YW\Z)g 

3.5.2  (Left-Intersection)  iX^Y\ZW)G  k  {W^Y\ZX)g  {XW-*^Y\Z)g 

3.6  (Transitivity)  {X^Y\Z)g  ia^Y\Z)G  V  {X^a\Z)G  Va  ^  AT  U  Z  U  F 

Figure  4.4:  Sound  and  complete  axioms  for  path-interception  in  directed  graphs 


75 


4.4  Deterministic  Causal  Relevance 


The  notion  of  causal  irrelevance  obtains  a  deterministic  definition  when  we  con¬ 
sider  the  effects  of  an  action  conditioned  on  a  specific  state  of  the  world  u. 

Definition  21  (causal  irrelevance)  X  is  causally  irrelevant  to  Y,  given  Z,  writ¬ 
ten  (X  74  VjZjr,  if 

'^U,Z,XjX  Yxz{u^  =■  Yx'z{u)  (4-6) 

in  every  submodel  of  Mz . 

This  definition  captures  the  intuition  “If  X  is  causally  irrelevant  to  K,  then 
X  cannot  affect  Y  under  any  circumstance.”  Note  that,  unlike  the  probabilistic 
definition  of  causal  irrelevance  (see  Eq.  (4.2)),  the  deterministic  definition  is 
equivalent  to 

Vu, ^r, ®  Yxziuf  —  Yz{u^  (4.7) 

Moreover,it  is  stronger  than  the  probabilistic  definition,  in  that  {X  ^  Y\Z)t 
{X  -h  YjZ)p. 

This  definition  of  irrelevance  bears  some  similarity  to  the  idea  of  limited  un¬ 
responsiveness  presented  by  Heckerman  and  Shacter  [HS95].  However,  whereas 
they  define  causality  in  terms  of  limited  unresponsiveness  to  a  specific  set  of  ac¬ 
tions,  we  view  irrelevance  as  a  property  of  the  configuration  of  the  mechanisms 
that  compose  causal  model.  In  fact,  a  version  of  their  definition  of  causality, 
translated  into  our  language,  will  be  shown  to  be  a  theorem  of  causal  irrelevance 
in  Section  4. 4.4. 2  (see  Eq.  (4.8)). 

To  see  why  we  require  the  equality  Yxz{u)  =  Yxiz{u)  to  hold  in  every  submodel 
of  Mz,  consider  the  causal  model  of  Figure  4.5.  In  this  example,  Z  follows  X 


76 


and,  hence,  V  follows  X,  that  is,  Vx=o(u)  =  Vx=i(u)  =  U2.  However,  since 
/y  is  a  nontrivial  function  of  X,  X  is  perceived  to  be  causally  relevant  to  Y. 
Only  holding  Z  constant  reveals  the  causal  influence  of  X  on  V.  To  capture  this 
intuition,  we  must  consider  all  submodels  in  Deflnition  21. 


X  =  Ui 

i' 

V  =  {X,Z,Y]  binary 

f  U2  a  X  =  Z 

J/  =  { 

X  otherwise 

X — U2 

Y 

U  =  {171,7/2}  binary 

Z  =  X 

P{ui)  =  P{u2)  =  0.5 

Figure  4.5.  Example  of  a  causal  model  that  requires  the  examination  of  submodels 
before  causal  relevance  can  be  determined 


4.4.1  Axioms  of  Causal  Irrelevance 

Given  this  deflnition  of  causal  irrelevance,  we  have  the  following  theorem: 
Theorem  8  For  any  causal  model,  the  following  sentences  must  hold: 

4.2.1  (Right-Decomposition)  {X  VWIZ)t  =>  (X  F|Z)r  &:  (X  W\Z)t 

4.2.2  (Left-Decomposition)  (XW  VjZjx  (X  yL  F|Z)r  &  (W  yL  F|Z)t 
4.4  (Strong  Union)  (A  Y\Z)t  {X  Y\ZW)t  V  W 

4.5.1  (Right-Intersection)  {X  7^  Y\ZW)t  k{X W\ZY)t  YW\Z)t 


77 


^.5.2  (Left-Intersection)  {X  />  Y\ZW)t  ^{W  VjZX)T  (XW  F|Z)t 
The  following  sentence,  however,  does  not  hold  in  every  causal  model: 

4.6  (Transitivity)  {X  Y\Z)T=^{a  Y\Z)t\/{X  a\Z)T  Va  ^  XUZUY 

4.4.2  Proofs  of  Causal  Irrelevance  Axioms 

Using  the  theorems  from  Section  2.7,  we  can  prove  Theorem  8,  the  axioms  of 

causal  relevance  are  sound. 

4.2.1  Holds  trivially.  □ 

4.2.2  (By  contradiction)  Assume  that  there  exists  a  causal  model  such 

that  (XW  yL  YjZ)T  &  ^((Z  yL  YIZ)t  &  (W YIZ)t).  So,  either 
(XW  -h  YlZ)Tk-^(X  -h  Y\Z)t:  or  (XW  yL  Y\Z)t^^(W  yU  Y\Z)t.  First, 
we  consider  (XW  yL  Y\Z)x  &  ~'(X  yL  Y\Z)x.  By  our  definition  of  causal 
irrelevance,  -.(A  Y\Z)t  implies  that  there  exist  two  values  x,x'  of  X 
and  some  value  u  of  U  such  that  Y^,(u)  ^  Y^,,(u).  Now,  let  us  consider 
the  values  x,x  ,z^u  such  that  Yxz(u)  ^  Yx>z(u),  Using  these  values,  we  can 
determine  w  and  w'  as  follows:  Let  w  =  Wxz(u),  and  w'  =  Wx'z(u).  It  does 
not  matter  whether  w  =  w'  or  w  ^  w'.  By  composition,  Yxzw(u)  ^  Yx'zw(u). 
Thus,  lx,w,z,u  Yx,xz(u)  ^  Yx>^>z(u),  which  contradicts  (XW  Y\Z)t. 
Thus,  (XW  Y\Z')x  &  -i(A  YjZjx  leads  to  a  contradiction.  We  can 
use  a  symmetric  argument  to  show  that  (XW  Y\Z)x  &  ^(W  Y\Z)x 
also  leads  to  a  contradiction.  □ 

4.4  By  our  definition  of  causal  irrelevance,  (X  Y\Z)x  =>  Yxz(u)  =  Yx'z(u) 
for  all  submodels  of  M^z-  For  an  arbitrary  W,  we  consider  the  submodel 


78 


where  W  is  forced  to  have  the  value  w.  By  our  definition  of  causal  irrel- 
evance,  ^  ^x'zw  for  all  values  w.  In  addition,  since  {X  Y\Z)x=^ 

=  F,,,(u)  for  all  submodels  of  M,  Y,,^{u)  =  for  all  submodels 

of  M^.  Since  W  was  arbitrary,  {X  Y\Z)x  ^  {X Y\ZW)t  for  all 
W. 

□ 

4-5-1  (By  contradiction)  Assume 

{X  74  Y\ZW)Tk{X  74  W\ZY)Tk-^{X  74  YW\Z)t.  -'(AT  74  YW\Z)t  im¬ 
plies  3x,  x\  z  (Y,4u)  ^  K,,(u))  V  {W,,{u)  ^  W,>,{u)).  Since  W  and  Y  are 
symmetric,  we  will  only  consider  Y.  Consider  the  values  of  x,x',z,u  such 
that  Y^,{u)  ^  Y^,,{u).  Let  y  =  Y,,(u)  and  y'  =  By  composition, 

Yxziu)  =  Y^,^{u)  for  w  =  W^,{u).  By  assumption,  Y^^^{u)  =  Y^>,^(u). 
Also  by  composition,  W,^{u)  =  W,,y{u)  for  y  =  Y,,{u).  By  assumption, 
W^,y{u)  =  W^,,y(u).  By  reversibility,  since  y  is  a  solution  to  the  simultane¬ 
ous  equations  y  =  Y^,^yj  and  w  =  W^'zy,  then  y  must  also  be  a  solution  to 
Yx>z{u).  Thus  y  =  y  ^  a,  contradiction.  We  can  use  a  symmetric  argument 
to  show  that  Wx3;(u)  Wxiz{u)  also  leads  to  a  contradiction.  □ 

4.5.2  (By 

contradiction)  Assume  {X  ^  Y\ZW)t  k 

(IT  74  YlZX)r  k  -^(XW  74  YlZ)r.  Since  ^(ATW  F|Z)r,  by  defini¬ 
tion  3x,x',w,w',z  Yx^^u)  ^  W(u).  However,  {X -f^Y\ZW)x  im¬ 
plies  Vx,x',  z,u;  Yxzwiu)  =  Yx>zyj(u).  Furthermore,  {W  -f^Y\ZX)T  im¬ 
plies  yx',w,w',z  Yx.^ziu)  =  W(u).  Thus,  \/x,x',w,w\z  Yxy,z{u)  = 
Yx'^z{u)  =  Yx>^>z{u),  and  so  ^x,x\w,w',z  Yx^z{u)  =  Yx'^'z{u).  This  con¬ 
tradicts  3x,  a:',  in,  in',  ^  Yx^z{u)izYx>^>z{u).  □ 


79 


4.4.3  Why  Transitivity  Fails  in  Causal  Relevance 


Causal  transitivity  is  a  property  that  makes  intuitive  sense.  If  a  variable  A  has  a 
causal  influence  on  B,  and  B  has  a  causal  influence  on  C,  one  would  think  that 
A  would  have  a  causal  influence  on  C.  This  is  not  always  the  case,  however,  even 
in  deterministic  causality.  Consider  the  causal  model  described  in  Figure  4.6.  In 
this  example,  X  is  causally  relevant  to  W ,  and  W  is  causally  relevant  to  Y,  but 
X  is  causally  irrelevant  to  Y.  The  intuition  behind  this  example  is  that  changing 
X  causes  only  a  minor  change  in  W,  while  Y  only  responds  to  large  changes  in 
W. 


The  failure  of  transitivity  is  deeper  than  this,  however.  Even  when  X  hcis 
more  complete  control  over  the  intermediate  variable  W,  we  still  may  not  be  able 
to  achieve  transitivity.  Consider  the  causal  model  of  Figure  4.7.  This  model  is  the 
same  as  the  model  of  Figure  4.6  except  W  has  now  been  split  into  Ri, . . . ,  IF4, 
corresponding  to  W’s  four  possible  values.  That  is,  Wi  is  true  if  x  +  uj  =  0,  W2  is 
true  if  X  +  U2  =  1,  W3  is  true  if  x  +  «2  =  2,  and  W4  is  true  if  x  +  U2  =  3.  Now,  by 


80 


fixing  X,  we  can  cause  any  of  the  intermediate  variables  Wfi, . . . ,  to  be  false 
m  any  given  state  of  the  world  u.  Likewise,  each  of  the  intermediate  variables 

Wi,...,W4  can  affect  Y  m  any  state  u.  However,  X  still  has  no  effect  on  Y  in 
any  state  u. 

4.4.4  Causal  Relevance  and  Directed  Graphs 
4.4.4. 1  Causal  Graphs  as  Irrelevance  Maps 

Comparing  axioms  3.2-3.5  to  axioms  4.2-4.5,  we  see  that  causal  irrelevance  is 
quite  similar  to  path-mterception  in  directed  graphs.  Since  people  (and  machines) 
can  easily  reason  about  graphs,  a  graph  that  represents  all  of  the  causal  relevances 
and  irrelevances  of  a  given  causal  model  would  be  useful.  That  is,  we  would  like 
to  create  a  graph  G*{M)  such  that 

(i)  Each  variable  X  in  M  corresponds  to  exactly  one  node  X*  in  G*{M), 

(ii)  For  all  subsets  of  nodes  X\Y\Z*  in  G\M),  {X*-*^Y*\Z*)G*i^M)  => 
{X  Y\Z')j^  and 

(iii)  For  all  subsets  of  variables  X,  T,  Z  in  M, 
{X  74  YIZ)t  =>  (X*-*^Y*|Z*)G,(J^f). 

In  graph  G*{M),  if  all  directed  paths  from  X*  to  Y*  are  intercepted  by  some 
variables  in  Z,  then  X  is  causally  irrelevant  to  Y  in  the  model  M.  Likewise,  if  a 
set  of  variables  X  is  causally  irrelevant  to  a  set  Y  given  fixed  Z,  then  all  paths 
from  nodes  in  X*  to  nodes  in  Y*  are  intercepted  by  some  variables  in  Z. 

The  obvious  choice  for  G*{M)  is  G(M),  the  graph  associated  with  the  causal 
model  itself,  as  defined  by  Eq.  (2.1).  If  we  use  G*{M)  =  G{M),  then  implication 


81 


(ii)  holds,  since  in  Section  4.3.3  we  showed  that  {X^Y\Z)g(m)  ==>  Y^^{u)  = 
F^(u),  and  thus  {X  Y\Z)t.  However,  since  transitivity  always  holds  in  path 
interception  but  does  not  always  in  causal  irrelevance,  for  a  given  model  M  there 
might  be  no  graph  G*(M)  such  that  implications  (ii)  and  (iii)  hold  simultaneously. 
Nonetheless,  we  can  use  directed  graphs  to  validate  candidate  theorems  of  causal 
irrelevance,  as  we  show  next. 

4. 4. 4. 2  Directed  Graphs  as  Theorem  Provers 

Consider  an  oracle  that  takes  in  statements  about  path-interception  and  returns 
YES  if  the  statement  holds  in  all  directed  graphs  and  NO  otherwise.  We  will 

show  that  such  an  oracle  can  be  used  to  validate  or  refute  sentences  about  causal 
relevance. 

First,  let  us  consider  a  language  of  causal  relevance  in  which  the  literals  stand 
for  simple  irrelevance  statements  of  the  form  {X  Y\Z)t,  where  X,  Y  and  Z 
are  sets  of  variables.  Second,  let  the  canonical  form  for  sentences  in  the  language 
of  causal  irrelevance  be  an  implication  Oi  &  og  & . . .  &  Ui  V  62  V . . .  V  6*  whose 
antecedent  consists  of  a  conjunction  of  non- negated  literals  and  whose  consequent 
consists  of  non-negated  literals.  For  instance,  consider  the  sentence^ 

(X  7A  Y\Z)t  &  -’(Y  7A  Y|0)7’  -'{Z  tA  F|0)y  (4.8) 

This  sentence  is  not  in  canonical  form  because  the  second  conjunct  in  the  an¬ 
tecedent  is  negated  and  the  statement  in  the  consequent  is  negated.  The  canonical 
form  of  this  sentence  is 

(x  7A  y\z)t  &  (z  tA  r|0)r  =>  (X  r|0)T  (4.9) 

A  version  of  this  sentence  was  chosen  in  [HS95]  as  the  definition  of  causality. 


82 


Any  causal  irrelevance  sentence  can  be  written  in  a  unique  canonical  form 
using  standard  logical  procedures. 

Definition  22  (Horn  component)  A  Horn  component  JI  of  a  causal  irrelevance 
sentence  S  is  a  sentence  H  such  that 

(i)  H  is  in  canonical  form, 

(ii)  The  consequent  of  H  contains  no  disjunctions,  and 
(Hi)  H=^S. 

If  a  sentence  S  is  in  the  canonical  form  aj  &  02  &  ...  &  a,-  =»  61  V  62  V  ...  V  6*, 
then  a  Horn  component  of  S  is  any  sentence  of  the  form  cj  &  02  &  ...  &  a,-  bj. 

For  example,  Eq.  (4.9)  has  no  disjunctions  in  its  consequent  and,  hence,  is  itself 
a  Horn  component. 

For  any  causal  irrelevance  statement  A  of  the  form  {X  Y\Z)t,  we  will  con¬ 
sider  Ag,  the  graphical  translation  of  A,  to  be  the  corresponding  path-interception 
statement  {X-*^Y\Z)g(m)‘  Using  this  convention,  we  can  define 

Theorem  9  (graphical  theorem  verification)  A  causal  irrelevance  sentence  S  is 
true  for  all  causal  models  iff  there  exists  a  Horn  component  H  of  S  such  that  Hg, 
the  graphical  translation  of  H ,  is  true  for  all  graphs. 

For  example,  consider  the  sentence  in  Eq.  (4.8).  The  canonical  form  of 
this  sentence  is  given  in  Eq.  (4.9)  and  is  itself  a  Horn  component.  The  sentence 
corresponding  to  Eq.  (4.9)  for  path-interception  in  directed  graphs,  {X^Y\Z)Gk 
(-^~*^U|0)(5,  states  that  if  all  paths  from  X  to  Y  are  intercepted 
by  Z,  and  there  are  no  paths  from  Z  to  F,  then  there  is  no  path  from  X  to  F. 


83 


This  sentence  is  true  for  all  directed  graphs,  so  Eq.  (4.8)  is  a  valid  theorem  of 
causal  relevance. 

Next,  consider  transitivity,  stated  as  {X  Y\Z)t  (a  />  Y\Z)t  V 
{X  74  a\Z)T.  The  Horn  components  of  this  sentence  are 

:  (X /i- Y\Z)t  =>  {a  Y\Z)t  (4.10) 

'•  {X Y\Z)t  {X a\Z)T.  (4.11) 

Looking  at  each  of  the  corresponding  path-interception  sentences  in  turn,  we  find 
that  neither  :  (X ^Y\Z)G=^{a^Y\Z)G  nov  :  {X^Y\Z)G^{X-*^a\Z)G 
is  true  for  all  directed  graphs  G,  that  is,  if  Z  intercepts  all  paths  from  X  to  Y, 
it  is  not  the  case  that  either  Z  intercepts  all  paths  from  any  other  variable  to  Y 
or  Z  intercepts  all  paths  from  X  to  any  other  variable.  Thus,  transitivity  is  not 
a  theorem  of  causal  relevance. 

Proof  (of  Theorem  9): 

First,  we  prove  that  if  there  are  no  disjunctions  in  the  consequent  of  a  canonical 
form  sentence,  then  the  sentence  is  true  iff  the  corresponding  sentence  is  true  for 
path-interception  in  directed  graphs. 

We  will  prove  this  by  contradiction.  Assume  that  there  exists  some  theorem 
A  — >  5,  where  A  and  B  are  conjunctions  of  literals  such  that 

(i)  A  B  is  not  a  theorem  in  causal  irrelevance,  and 
(ii)  Ag  Bg  is  a  theorem  in  path-interception  in  directed  graphs 


Since  Ag  Bg  is  a  theorem  in  path-interception,  then  we  must  be  able  to 
generate  Bg  from  Ag  using  the  axioms  of  path-interception  in  directed  graphs. 


However,  since  A  >  B  is  not  a  theorem  in  causal  irrelevance,  every  such  gener¬ 
ation  of  Bg  from  Ag  must  include  application  of  the  axiom  of  transitivity.  When 
the  axiom  of  transitivity  is  used,  a  disjunction  is  created.  This  disjunction  must 
be  used  in  the  generation  of  Bg.  By  assumption,  Bg  does  not  contain  a  dis¬ 
junction.  Also,  none  of  the  antecedents  of  any  of  the  axioms  of  path-interception 
contain  disjunctions.  Thus,  the  only  way  to  use  this  disjunction  in  the  generation 
of  Bg  is  to  resolve  the  disjunction  with  a  negated  clause.  Since  Ag  started  with 
no  negated  statements,  and  none  of  the  axioms  of  path-interception  can  be  used 
to  create  negated  statements,  we  cannot  resolve  the  disjunction  with  anything. 

Thus,  generating  Bg  from  Ag  did  not  require  an  application  of  transitivity,  a 
contradiction. 

Next,  we  prove  that  if  a  theorem  A=^B\/C  is  a  theorem  in  causal  irrelevance, 
then  either  A  =»  H  is  a  theorem  in  causal  irrelevance  or  A  (7  is  a  theorem 
in  causal  irrelevance.  If  A  — ^  H  V  C  is  a  theorem  in  causal  irrelevance,  then  we 
must  be  able  to  generate  By  C  from  A  using  the  axioms  of  causal  irrelevance. 
Since  no  axiom  creates  a  disjunction,  to  generate  By  C  from  A  we  must  either 
generate  B  from  A  and  add  C  or  generate  C  from  A  and  add  B.  Thus,  a  causal 
irrelevance  sentence  is  a  theorem  iff  there  is  a  path-interception  theorem  that 
corresponds  to  one  of  the  Horn  components  of  the  original  sentence.  □ 

4.5  Applications  of  Deterministic  Causal  Relevance 

Frequently,  researchers  would  like  to  know  the  causal  effect  of  some  variable, 
that  is,  P(^y\x).  This  question  comes  up  especially  in  the  medical  sciences,  where 
researchers  want  to  determine  the  effectiveness  of  a  particular  drug  or  treatment. 
This  quantity,  H(j/|a;),  is  often  difficult,  if  not  impossible,  to  measure.  However, 


85 


the  conditional  probability  P{y\x)  is  often  relatively  easy  to  measure.  We  would 

like  to  be  able  to  relate  the  measurable  quantity,  P{y\x),  to  the  desired  quantity, 
P{yx). 

We  can  use  the  determinisitic  definition  of  causal  irrelevance  to  show  when 
observation  yields  the  same  probability  as  action,  that  is,  when  P{y\x)  =  P{y\x). 
The  following  theorem  gives  the  conditions  under  which  observation  of  X  yields 
the  same  probability  distribution  on  Y  as  intervention  on  X . 

Theorem  10  For  any  two  variables  X,Y,  and  for  any  two  values  x,y,  P{y\x)  = 
P{y\x)  ifWA  eU[JV,{A-/^  A:|0)t  V  (A  74  Y\X)t. 

Theorem  10  is  a  precise  statement  of  what  is  called  ignorability  in  the  statis¬ 
tical  literature,  and  it  justifies  the  use  of  randomized  experiments  to  measure  the 
quantity  P{y\x).  For  example,  when  trying  to  determine  the  causal  effect  of  the 
dosage  of  a  drug  (X)  on  the  the  recovery  of  a  patient  (F),  researchers  will  often 
utilize  a  randomized  experiment.  The  treatment  assignment  {Z)  is  randomized. 
Assuming  perfect  compliance,  the  dosage  X  each  patient  takes  is  determined  com¬ 
pletely  by  the  treatment  assignment  Z,  and  the  recovery  F  is  measured.  Since  Z 
is  randomized,  VA,  {A  Zf0)T.  If  there  is  full  compliance,  then  for  any  variable 
W  other  than  Z,  {W  yA  X\0)t.  If  the  experiment  is  double-blind,  placebos  are 
used  to  prevent  the  doctors  and  the  patients  from  knowing  the  value  of  either  the 
treatment  assignment  X  or  the  dosage  Z,  and,  thus,  the  treatment  Z  cannot  af¬ 
fect  the  recovery  F  except  through  the  action  of  the  drug.  That  is,  (Z  yA  Y\X)t. 
So,  for  any  variable  Ag17UF,  ifA  =  Z  then  (A  y4  Y\X)t,  and  {W  y4  X\0)t 
otherwise.  Thus  VA  €  f/UF,  (A  y4  X|0)r  V(A  y4  Y\X)t,  and  P(ylx)  =  P(ylx). 
Proof  of  Theorem  10: 


86 


Given  ^A£U\JV,{A^  X\^)t  V  (A  />  Y\X)t,  we  show  that  P{y\x)  =  P{y\x). 

We  separate  the  exogenous  variables  U  into  two  groups,  U'  and  U",  such  that 
{U'  74  J5£:|0)y  and  (U"  Y\X)t.  We  first  consider  the  set  of  possible  values  u" 
of  U".  We  are  going  to  create  a  set  of  values  u"  of  U",  which  we  call  a,  that  is 
the  set  of  all  values  u"  of  U"  for  which  there  exists  a  value  u'  of  U'  such  that 
X{u\u")  =  X.  Now  we  consider  each  element  a  of  a,  and  for  each  element  o  we 
create  a  set  Set  is  the  set  of  all  values  u'  of  U'  such  that  Y{u',  a)  =  y.  We 
can  now  express  P{y\x)  in  terms  of  a  and  ^a- 

p(y\x)  =  -P(q)  Efcega  P{^) 

T.aeaP{a) 

By  the  causal  irrelevances  that  we  used  to  separate  U  into  U'  and  U",  and  by 
composition,  we  know  that  for  any  two  elements  01,02  G  a,  ^ai  =  ^a2-  We  can 

thus  define  a  new  set  7  such  that  Vo  €  a,  7  =  Hence,  the  above  formula  can 
be  simplified  to 

P{y\x)  =  Y,P{9) 

We  now  consider  the  value  of  P{y\x).  By  the  irrelevancies  that  separate  U 
into  U'  and  V\  we  know  that,  in  the  submodel  with  JV:  fixed  at  x,  F  is  a  function 
only  of  members  of  U' ,  and  not  of  members  of  f7",  and 

P{y\x)  =  Y.  P(u') 

«'6t/'|Kx(u')=V 

We  now  consider  the  set  (u'  €  =  yj.  This  set  is  the  set  of  all  values 

u'  of  [/'  such  that  when  X  has  the  value  x,  Y  will  have  the  value  y.  This  set  is 
identical  to  7,  so 

P(yl^)  =  YP(y)  =  P(yM 

and  P{y\x)  =  P{y\x).  □ 


87 


In  other  words,  causal  relevance  gives  a  theoretical  justification  for  the  use  of 
randomized  experiments  for  measuring  causal  effects. 

4.6  Conclusion 

How  do  scientists  predict  the  outcome  of  one  experiment  from  the  results  of 
other  experiments  run  under  totally  different  conditions?  Such  transfer  of  exper¬ 
imental  knowledge,  although  essential  to  scientific  progress,  involves  inferences 
that  cannot  easily  be  formalized  in  the  standard  languages  of  logic,  physics,  or 
probability. 

The  formalization  of  such  inferences  requires  a  language  within  which  the  ex¬ 
perimental  conditions  prevailing  in  an  experiment  can  be  represented  and  then 
the  outcome  of  that  experiment  can  be  posed  as  a  constraint  in  the  design  and 
analysis  of  the  next  experiment.  Description  of  experimental  conditions,  in  turn, 
involves  both  observational  and  manipulative  sentences,  and  it  requires  that  ma¬ 
nipulative  phrases  (e.g.,  “having  no  effect  on,”  “holding  Z  fixed”),  as  distinct 
from  observational  phrases  (e.g.,  “being  independent  of,”  “conditioning  on  Z”),‘‘ 
be  given  formal  notation,  semantic  interpretation,  and  axiomatic  characteriza¬ 
tion.  It  turns  out  that  standard  algebras,  including  the  algebra  of  equations. 
Boolean  algebra,  and  probability  calculus,  are  all  geared  to  serve  observational 
but  not  manipulative  sentences. 

This  chapter  bases  the  semantics  of  manipulative  sentences  on  a  set  of  struc¬ 
tural  equations  that  we  call  a  causal  model.  Unlike  ordinary  algebraic  equations, 

a  causal  model  treats  every  equation  as  an  independent  mathematical  object  at- 

Philosophers,  statisticians,  and  economists  have  been  notoriously  sloppy  about  confusing 
“holding  Z  constant”  with  “conditioning  on  a  given  Z”  [Pea95a]. 


88 


tached  to  one  and  only  one  variable.  Actions  are  treated  as  modalities  and  are 
interpreted  as  the  non  algebraic  operator  of  replacing  equations. 

This  semantics  permits  us  to  develop  an  axiomatic  characterization  of  ma¬ 
nipulative  statements  of  the  form  “Changing  X  will  not  affect  Y  if  we  hold  Z 
constant,”  that  we  propose  as  the  meaning  of  causal  irrelevance:  “X  is  causally 
irrelevant  to  Y  in  context  Z.”  This  axiomatization  highlights  the  differences  be¬ 
tween  causal  irrelevance  and  informational  irrelevance,  as  in  “Finding  X  will 
not  affect  our  belief  in  Y,  once  we  know  Z.”  The  former  shows  a  closer  affinity 
to  graphical  representation  than  the  latter.  Under  the  deterministic  definition, 
causal  irrelevance  complies  with  all  of  the  axioms  of  path  interception  in  cyclic 
graphs  except  of  transitivity.  This  affinity  leads  to  graphical  methods  of  prov¬ 
ing  theorems  about  causal  relevance  and  explains,  in  part,  why  graphs  are  so 
prevalent  in  causal  talk  and  causal  modeling. 

Outside  of  artificial  intelligence,  our  results  have  interesting  ramifications 
in  the  fields  of  statistics  and  epidemiology  where,  thus  far  the  only  ac¬ 
cepted  formalization  of  causation  has  been  Rubin’s  framework  of  counterfactuals 
[Rub74,  Rob86b],  which  is  a  rather  cumbersome  language  for  expressing  causal 
knowledge.  Graphical  and  structural  equation  models,  populax  as  they  are  in 
econometrics  and  the  social  sciences,  are  viewed  with  suspicion  by  statisticians 
because  the  causal  interpretation  of  these  models  has  not  been  adequately  for¬ 
malized  [Fre87,  Wer92]. 

Our  translation  of  counterfactuals  into  statements  about  structural  equation 
models  (Definition  6)  generalizes  and  unifies  the  structural  and  counterfactual 
approaches,  and  clarifies  their  conceptual  and  mathematical  bases.  The  sound¬ 
ness  of  effectiveness  and  composition  -  the  only  properties  of  counterfactuals  used 


89 


by  Rubin — ensures  that  every  theorem  in  Rubin’s  framework  is  also  a  theorem  in 
structural  equation  models.  The  completeness  of  effectiveness  and  composition  in 
recursive  models  guarantees  that  the  structural  interpretation  of  counterfact uals 
introduces  no  extraneous  properties  beyond  those  embodied  in  Rubin’s  frame¬ 
work.  Most  significant,  this  unification  permits  investigators  to  express  causal 
knowledge  in  the  intuitively  appealing  language  of  causal  graphs,  use  the  graphs 
as  inferential  machinery,  and  be  assured  of  the  validity  of  the  results. 


90 


X  =  U\ 

Wi  —  -IX  h  ->U2 

W2  =  X  h  -1U2 

^  —  {^1  Wi,W2,  W3,  W4,  Y]  binary 

W3  =  -IX  Sc  U2 

U  =  {U\,  U2}  binary 

W4  =  X  Sc  U2 

y  = 

(t«3  &  -'W2)W 


{w^  &  —tWi  &  ~'U>2) 


P{ui)  =  P{U2)  =  0.5 


Figure  4.7:  Transitivity  fails,  even  when  a  variable  is  more  completely  controlled 
by  its  parents 


9 


CHAPTER  5 


Identifying  Causal  Effects 

5.1  Introduction 

This  chapter  addresses  one  of  the  applications  of  causal  models:  determining  the 
causal  effect  of  a  variable  or  set  of  variables  on  another  variable  or  set  of  variables. 
As  an  introduction,  let  us  consider  a  possible  problem  that  we  would  like  to 
solve.  Assume  we  need  to  replace  an  expert  operating  a  complex  production 
plant.  Before  we  take  charge,  we  are  given  a  blueprint  of  the  plant  together  with 
an  explanation  of  the  functions  of  the  various  dials  and  knobs,  and  we  are  able 
to  observe  the  expert  in  action  over  a  long  period  of  time.  During  this  period,  we 
record  which  dials  the  expert  consults  prior  to  taking  actions.  Moreover,  although 
we  understand  the  function  of  those  dials,  we  cannot  always  observe  the  actual 
reading  on  each  them.  The  data  we  are  able  to  collect  during  the  observation 
period  include  the  actions  taken  by  the  agent,  the  readings  of  some  of  the  dials, 
and  the  outcome  of  various  performance  indicators.  Our  problem  is  to  predict, 
on  the  basis  of  the  data  collected,  the  effect  of  a  given  action  on  the  performance 
of  the  plant. 

The  problem  of  learning  from  the  actions  of  other  agents  is  that  one  is  never 
sure  whether  an  observed  response  is  due  to  the  agent’s  action  or  due  to  events 
that  triggered  that  action  and  simultaneously  caused  the  response.  Such  events 


92 


are  called  confounders,  and  they  present  a  major  problem  in  the  analysis  of 
observational  studies  in  the  social  and  health  sciences.  For  example,  we  cannot 
be  sure  whether  it  was  the  prescribed  drug,  or  the  some  prior  condition,  which  the 
doctor  tried  to  treat  by  prescribing  the  drug,  that  caused  the  patient  to  vomit. 
Similarly,  we  cannot  tell  whether  a  recession  was  caused  by  higher  taxes  or  by  the 
economic  indicators  which  government  experts  consulted  before  raising  taxes. 

The  standard  technique  for  dealing  with  confounders  is  to  adjust  for  possi¬ 
ble  variations  in  those  environmental  factors  which  might  trigger  the  actions. 
This  mounts  to  conditioning  the  observed  distribution  on  the  various  levels  of 
those  factors,  evaluating  the  action  in  each  level  separately,  and  then  taking  the 
(weighted)  average  over  those  levels.  However,  in  problems  like  those  described 
above,  some  of  the  confounding  factors  are  unobservable;  hence,  they  cannot  be 
conditioned  on. 

The  techniques  developed  in  this  chapter  will  enable  us  to  recognize,  by  graph¬ 
ical  means,  whether  a  given  action  can  be  evaluated  from  joint  distributions  on 
observed  quantities  and,  if  so,  to  decide  which  quantities  should  be  measured  and 
how  to  adjust  for  them.  Technically  speaking,  the  task  parallels  the  identifica¬ 
tion  of  recursive  structural  equations  in  the  presence  of  unmeasured  variables. 
However,  whereas  traditional  theories  of  identifiability  deal  exclusively  with  es¬ 
timating  linear  coefficients  in  parametric  equations,  the  identifiability  problem 
solved  in  this  chapter  is  nonparametric;  no  assumptions  are  made  regarding  ei- 
hter  the  functional  forms  of  the  structural  equations  or  the  distributions  of  the 
errors.^ 

^Naturally,  nonparametric  identifiability  is  not  concerned  with  values  of  numerical  param¬ 
eters  but  with  the  ultimate  purpose  to  which  parameters  are  being  put  in  structural  models, 
namely,  the  analysis  of  actions  and  causal  effects. 


93 


5.2  Identifiability  in  Econometrics 


Determining  the  causal  effect  of  one  variable  on  another  variable  has  been  exten¬ 
sively  explored  in  the  econometric  literature,  where  it  is  called  the  identification 
problem  [Fis66,  KR51].  The  economic  structural  equation  model  consists  of  M 
equations  in  N  variables,  with  M  random  disturbances.  These  equations  are  as¬ 
sumed  to  be  linear  in  both  the  variables  and  the  disturbances.  Thus,  the  model 
can  be  summed  up  by  the  matrix  equation 

A*X  =  U  (51) 

where  X  is  a,  1  x  N  matrix  of  the  observed  variables,  A  is  a,  M  x  N  matrix  of 
linear  coefficients,  and  t/  is  a  1  x  M  matrix  of  random  disturbances  (which  are 
all  zero  when  considering  nonstochastic  cases). 

The  object  of  the  identification  problem  is  to  determine  the  value  of  the  ma¬ 
trix  A.  In  some  instances,  all  of  the  values  in  matrix  A  can  be  determined;  and 
m  other  cases,  only  certain  coefficients  can  be  determined.  In  the  econometric 
literature,  when  a  coefficient  can  be  determined  by  the  data,  it  is  said  to  be  iden¬ 
tifiable.  In  the  identification  problem,  we  want  to  find  not  only  a  matrix  that  is 

observationally  equivalent  but  the  actual  matrix  that  determines  the  interactions 
of  the  variables  X. 

The  value  of  identifying  a  system  of  equations  is  obvious.  As  soon  as  we  know 
the  equations  that  model  a  system,  we  know  everything  about  how  the  system 
behaves.  We  can  determine  the  value  of  any  set  of  variables  under  all  possible 
interventions  on  other  variables.  We  can  consider  how  changes  to  the  model 
will  affect  various  variables.  Thus,  we  can  use  such  a  system  to  inform  policy 
decisions.  Even  if  we  can  only  identify  some  of  the  coefficients,  in  many  cases  we 


94 


can  still  determine  the  direct  or  total  effect  of  one  variable  on  another. 

There  are,  however,  many  limitations  to  this  approach.  One  of  the  greatest 
is  the  linearity  assumption.  We  often  cannot  assume  that  all  of  the  interactions 
are  linear,  yet  if  we  allow  each  variable  to  be  an  arbitrary  function  of  the  other 
variables  in  the  system,  then  there  is  no  way,  in  general,  to  completely  determine 
the  values  of  the  functions,  as  we  can  in  the  linear  case.  All  is  not  lost,  however. 
Let  us  consider  why  we  wanted  to  find  the  values  of  the  functions  in  the  first 
place.  We  need  the  actual  equations,  instead  of  observationally  equivalent  ones, 
only  when  we  want  to  predict  how  the  system  would  react  to  interventions  that 
are  outside  the  measured  values  of  the  variables.  We  will  show  how  we  can 
obtain  information  about  how  the  system  responds  to  interventions,  even  without 
completely  specifying  the  equations  in  the  model. 

5.3  Identification  in  Causal  Models 

In  the  language  of  causal  models,  the  problem  addressed  in  this  chapter  is  the 
evaluation  of  the  effects  of  a  concurrent  action  do[X  =  x),  where  X  is  some 
subset  of  variables  from  V ,  on  a  subset  of  variables  Y ,  We  will  be  examining 
such  actions  for  the  case  where  the  causal  model  is  not  fully  specified.  We  are 
given  the  topology  of  the  causal  model  but  not  the  actual  functions  that  relate  the 
variables  to  each  other.  Numerical  probabilities  are  given  for  only  the  variables 
which  are  deemed  “observable,”  while  the  variables  deemed  “unobservable”  serve 
only  to  specify  possible  connections  among  observed  quantities,  and  are  not  given 
numerical  probabilities. 

Pearl  [Pea94]  has  reviewed  the  use  of  causal  models  in  this  fashion  and  pro¬ 
posed  a  calculus  for  deriving  probabilistic  assessments  of  the  effects  of  actions 


in  the  presence  of  unmeasured  variables.  This  calculus  can  be  used  to  check  or 
search  for  a  proof  that  the  elfect  of  one  variable  on  another  is  identifiable,  namely, 
that  it  IS  possible  from  data  involving  only  observed  variables,  to  obtain  a  con¬ 
sistent  estimate  of  the  probability  of  Y  under  the  condition  that  X  is  set  to  x  by 
external  intervention.  This  chapter  systematizes  the  search  for  such  a  proof  by 
providing  a  polynomial-time  graph-based  method  for  determining  whether  the 
effect  of  one  variable  on  another  is  identifiable.  ^ 

If  identifiability  is  confirmed,  the  method  generates  closed-form  expressions 
for  the  distribution  of  the  outcome  variable  Y  under  the  external  manipulation 
of  the  control  variable  X.  The  derived  expression,  denoted  P{y\do{x)),  invokes 
only  measured  probabilities  as  obtained,  for  example,  by  recording  the  past  per¬ 
formance  of  other  acting  agents.  Although  the  actions  of  those  agents  may  have 
been  triggered  by  factors  unseen  by  the  analyst,  the  impact  of  X  on  F  can  still 
be  predicted  using  observed  variables  only.  If  F  stands  for  a  goal  variable,  then 
the  probability  of  reaching  the  goal  through  each  action  do{X  =  x)  can  be  de- 
termined  from  such  partial  observations. 

5.4  Notation  and  Definitions 

We  now  provide  some  notation  and  definitions  that  will  be  required  in  the  rest 
of  the  chapter. 

Definition  23  (identifiability)  The  causal  effect  of  X  on  Y  is  said  to  be  identifi¬ 
able  if  the  quantity  P{y\x)  can  he  computed  uniquely  from  any  positive  distribution 
of  the  observed  variables,  that  is,  if  for  every  pair  of  models  Mi  and  M2  such  that 

^An  extension  of  our  analysis  to  the  case  of  multiple  actions  (sequential  or  concurrent)  is 
reported  in  [PR95].  ’ 


96 


PM,{y)  =  ^MaCv)  >  0,  we  have  PmM^)  =  PmM^) 

Identifiabihty  means  that  P{y\x)  can  be  estimated  consistently  from  an  arbi¬ 
trarily  large  sample  randomly  drawn  from  the  distribution  of  the  observed  vari¬ 
ables. 

Definition  24  (back-door  path)  A  path  from  X  to  Y  in  a  graph  G  is  said  to  he 
a  back-door  path  if  it  contains  an  arrow  into  X . 


The  probabilistic  analysis  of  causal  models  becomes  particularly  simple  when 
two  conditions  are  satisfied: 

1.  The  model  is  recursive,  that  is,  there  exists  an  ordering  of  the  variables 

^  ~  such  that  each  Xi  is  a  function  of  a  subset  pa,-  of  its 

predecessors,  denoted 

=  /i(pa,-, Ui),  pa,.  C{Xi,..., }  (5.2) 

2.  The  disturbances  £/^i, . . . ,  are  mutually  independent,  UilUj,  which  also 
implies  (from  the  exogeneity  of  the  Ufs)  that 

Ui±{X^,...,Xi.,}  (5.3) 

These  two  conditions,  also  called  the  Markovian  assumptions,  are  the  basis  of 
Bayesian  networks  [Pea88j,  and  they  enable  us  to  compute  causal  effects  directly 
from  the  conditional  probabilities  P(x,  |pa,)  without  having  to  specify  either  the 
functional  form  of  the  functions  /,-  or  the  distributions  P{ui)  of  the  disturbances. 
This  is  seen  immediately  from  the  following  observations. 


97 


The  distribution  induced  by  any  Markovian  model  M  is  given  by  the  product 
TW(xi,...,a:„)  =  J]P(a:,|pa,)  (5.4) 

i 

where  pa,-  are  the  direct  predecessors  (called  parents)  of  X,  in  the  diagram. 
The  distribution  induced  by  the  the  submodel  which  represents  the  ac¬ 
tion  do{Xj  =  x'),  is  also  Markovian  and,  hence,  also  induces  a  product-like 
distribution: 

(^1 1  •  •  •  1  Xfi) 

3 


=  { 


n,w  p(x,|pai)  =  if  ^ 


(5.5) 


0  if  If  #  i(. 

where  the  partial  product  reflects  the  surgical  removal  of 


^3  =  /i(paj,  Uj) 

from  the  model  of  Eq.  (5.2). 


5.5  Action  Calculus 

The  identifiability  of  causal  effects  demonstrated  in  Section  5.3  relies  critically  on 
the  the  two  Markovian  assumptions  given  by  Eqs.  (5.2)  and  (5.3).  If  a  variable 
that  has  two  descendants  in  the  graph  is  unobserved,  the  disturbances  in  the 
equations  for  those  two  descendants  are  no  longer  independent,  the  Markovian 
assumtion  given  by  Eq.  (5.2)  is  violated,  and  identifiability  may  be  destroyed. 
This  can  be  seen  easily  from  Eq.  (5.5):  if  any  parent  of  the  manipulated  variable 
Xj  is  unobserved,  one  cannot  estimate  the  conditional  probability  P(xj  |paj),  and 


98 


the  eifect  of  the  action  do{Xj  -  xj)  may  not  be  predictable  from  the  observed 
distribution  P{xi, . . . ,  Xn).  Fortunately,  certain  causal  effects  are  identifiable  even 
in  situations  where  members  of  pa,-  are  be  unobservable,  and  these  situations  can 
be  recognized  through  the  action  calculus  introduced  in  [Pea94]. 

Let  X,  F,  and  Z  be  arbitrary  disjoint  sets  of  nodes  in  a  Directed  Acyclic 
Graph  (DAG)  G.  We  denote  by  Gy  the  graph  obtained  by  deleting  from  G  all 
arrows  pointing  to  nodes  in  X.  Likewise,  we  denote  by  the  graph  obtained  by 
deleting  from  G  all  arrows  emerging  from  nodes  in  X.  To  represent  the  deletion 
of  both  incoming  and  outgoing  arrows,  we  use  the  notation  Gyz-  Finally,  the 
expression  P{y\x,  z)  ^  P{y,  ^|x)/P(^|:r)  stands  for  the  probability  of  F  =  j/  given 
that  Z  =  z  IS  observed  and  X  is  held  constant  at  x. 

If  G  is  a  DAG,  let  Gy  stand  for  the  subgraph  of  G  with  all  the  arcs  incident 
to  variables  in  X  removed,  and  G^  stand  for  the  subgraph  of  G  with  all  the 
arcs  emanating  from  X  removed.  Likwise,  let  (F  _[[_  X\Z)g  stand  for  F  being 
d-separated  from  X  by  2'  in  the  graph  G  [Pea88]. 

The  following  theorem  states  the  three  basic  inference  rules  used  in  the  chap- 
ter. 

Theorem  11  Let  G  he  the  directed  acyclic  graph  associated  with  a  causal  model, 
and  let  P{-)  stand  for  the  probability  distribution  induced  by  that  model.  For  any 
disjoint  subsets  of  variables  X,  F,  Z,  and  W  we  have: 

Rule  1  (Insertion/deletion  of  observations) 

P{y\x,  z,  w)  =  P{y\x,  w)  if  (F  J]_  ZjX,  W)g^  (5.6) 

Rule  2  (Action/observation  exchange) 


99 


(5.7) 


P(y\x,z,w)  =  P(y\x,z,w)  if  (Y  jl_Z\X,W)o-^ 

Rule  3  (Insertion/deletion  of  actions) 

P(y\x,z,w)  =  P{y\x,w)  if  (Y  ||  ZjX,  W)g _ 

where  Z{W)  is  the  set  of  Z -nodes  that  are  not  ancestors  of  any  W-node  in 

Gx- 

Each  of  these  inference  rules  follows  from  the  basic  interpretation  of  the  “x” 
operator  as  the  replacement  of  the  causal  mechanism  that  connects  X  to  its  pre¬ 
action  parents  by  a  new  mechanism  X  =  x  introduced  by  the  intervening  force. 
The  result  is  a  submodel  characterized  by  the  subgraph  %  (named  manipulated 
graph  in  [SGS93])  that  supports  all  three  rules. 

Rule  1  reaffirms  d-separation  as  a  valid  test  for  conditional  independence  in 
the  distribution  resulting  from  the  intervention  set{X  =  x),  hence  the  graph  Gy- 
This  rule  follows  from  the  fact  that  deleting  equations  from  the  system  does  not 
introduce  any  dependencies  among  the  remaining  disturbance  terms. 

Rule  2  provides  a  condition  for  an  external  intervention  set(Z  =  z)  to  have 
the  same  effect  on  F  as  the  passive  observation  Z  =  ^.  The  condition  amounts 

to  {XUW}  blocking  all  back-door  paths  from  ZtoY  (in  %),  since  Gyz  retains 
all  (and  only)  such  paths. 

Rule  3  provides  conditions  for  introducing  (or  deleting)  an  external  interven¬ 
tion  set{Z  =  z)  without  affecting  the  probability  o{Y  =  y.  The  validity  of  this 
rule  stems,  again,  from  simulating  the  intervention  set{Z  =  z)  by  the  deletion  of 
all  equations  corresponding  to  the  variables  in  Z  (hence  the  graph  Gyz)- 


100 


Corollary  2  A  causal  effect  q  -  P{yi,  ...,yk\xi,  is  identifiable  in  a  model 

characterized  by  a  graph  G  if  there  exists  a  finite  sequence  of  transformations, 
each  conforming  to  one  of  the  inference  rules  in  Theorem  11,  which  reduces  q  to 
a  standard  (i.e.,  hat-free)  probability  expression  involving  observed  quantities. 

Although  Theorem  11  and  Corollary  2  require  that  the  Markovian  assum- 
tions  hold,  they  can  be  applied  to  recursive  non-Markovian  models,  because  such 
models  become  Markovian  if  we  consider  the  unobserved  variables  as  part  of  the 
analysis  and  represent  these  variables  as  nodes  in  the  graph. 

5.6  A  Graphical  Criterion  for  Testing  Identifiability 

To  avoid  excessive  notation,  for  the  rest  of  this  chapter,  we  will  consistently 
refer  to  queries  P{y\x)  that  satisfy  Corollary  2  as  “identifiable,”  with  the  under¬ 
standing  that  Corollary  2  represents  a  sufficient  but  not  (yet)  necessary  condition 
for  semantical  identifiability,  given  in  Definition  23.  The  two  notions  would  be 
equivalent  if  the  rules  in  Theorem  11  were  complete. 

Theorem  12  A  necessary  and  sufficient  condition  for  the  identifiability  of 
P{.y\Pl  ®  graph  G  is  that  G  satisfies  one  of  the  following  four  conditions: 

1.  There  is  no  back-door  path  from  AT  to  F  in  G,  that  is,  {X  ||  Y)gx- 

2.  There  is  no  directed  path  from  X  to  F  in  G. 

3.  There  exists  a  set  of  nodes  B  that  blocks  all  back-door  paths  from  to  F 
such  that  P{h\x)  is  identifiable.  A  special  case  of  this  condition  occurs  when 
B  consists  entirely  of  nondescendants  of  X,  in  which  case  P{b\x)  reduces 
immediately  to  P{b). 


101 


4.  There  exists  set  of  nodes  Zx  and  such  that 

(i)  Zx  blocks  every  directed  path  from  X  to  F, 

^X\Zx)g^^^ 

(ii)  Z2  blocks  all  back-door  paths  between  Zx  and  F, 
i.e.,  (^X^i  1^2)03,^, 

(iii)  Z2  blocks  all  back-door  paths  between  X  and  Zx , 
i.e.,  {X  X 

(iv)  Z2  does  not  conduct  any  back-door  paths  from  X  to  F, 

I.e.,  {X  X  F|Zi,Z2)g__.  This  condition  holds  if  the  conditions 
(i)-(in)  above  are  met  and  no  member  of  Z2  is  a  descendant  of  X. 

A  special  case  of  Condition  4  occurs  when  Z2  =  0  and  there  is  no  back  door 
path  from  X  to  Zx  or  from  Zx  to  F. 

Proof  (of  Theorem  12): 

We  prove  the  sufficiency  of  Conditions  (l)-(4)  above,  then  turn  to  proving  their 
necessity. 

Condition  1:  If  there  is  no  directed  path  from  AT  to  F  in  G,  then 
(F  X  ^)oy-  So,  by  Rule  3,  P{y\x)  =  P{y),  and  the  query  is  identifiable. 

Condition  2:  This  condition  follows  directly  from  Rule  1.  If  (F  ||  X)gx, 
then  we  can  immediately  change  P{y\x)  to  P{y\x),  so  the  query  is  identifiable. 

Condition  3:  If  there  is  a  set  of  nodes  B  that  blocks  all  back-door  paths  from 
X  to  F,  then  we  can  rewrite  P{y\x)  as  Efc 6)P(6|x).  Since  B  blocks  all 
back-door  paths  from  X  to  F,  it  must  be  the  case  that  (F  X  X\B)g^,  and  thus. 


102 


by  Rule  2,  we  can  rewrite  P{y\x,  b)  as  P{y\x,  b).  If  the  query  (6|f )  is  identifiable, 
then  the  original  query  must  also  be  identifiable.  See  examples  in  Figure  5.1. 


o'  ””  "  S 

^1/  Z  \i 

X(  1 - ^  B 

\v/ 

Y 

Y* 

(a) 

(b) 

Figure  5.1:  Illustrating  Condition  3  of  Theorem  12.  In  a,  the  set  {81,82}  blocks 
all  back-door  paths  from  X  to  F  and  P{h^,h2\x)  =  ^(61,62).  In  b,  the  node  8 

blocks  all  back-door  paths  from  X  to  Y,  and  P{b\x)  is  identifiable  using  Condition 

4. 

Condition  4.  If  there  is  a  set  of  nodes  Zi  that  block  all  directed  paths  from 
X  ioY  and  a  set  of  nodes  Z2  that  block  all  back-door  paths  between  Y  and  Zi 
in  Gy,  that  we  expand  P{y\x)  =  P{y\x,Zr,Z2)P{zuZ2\x).  we  can  rewrite 

P{y\x,zi,Z2)  as  P{y\x,zi,Z2)  using  Rule  2,  since  all  back-door  paths  between  Z\ 
and  Y  are  blocked  by  Z2  in  Gy-  We  can  reduce  P{y\x,zi,Z2)  to  P{y\zi,Z2) 
using  Rule  3,  since  (Y  [|  X\Zi,  Z2)a^ We  can  rewrite  P(y\zi,Z2)  as 
8{y\zi,Z2)  if  (F  _[|_  Zj  1^2)0^.  The  only  way  that  this  independence  cannot 
hold  is  if  there  is  a  path  from  F  to  Zi  through  X,  since  (F  ||  Zi\Z2)g- 
However,  we  can  block  this  path  by  conditioning  and  summing  over  X,  to  get 
-2^2?  ^  ^2)*  Now  we  can  rewrite  ^2?  as  Z2, 

using  Rule  2.  can  be  rewritten  as  F(x'lz2)  using  Rule  3,  since 

is  a  child  of  JC  and  the  graph  is  acyclic.  So,  the  query  can  be  rewritten 
I^x'  F(yjzi,  Z2,  x')P(x'lz2)P(zi,  Z2jx).  P(zi,  Z2lx)  =  P(z2li)P(^llx,  Z2). 
Since  Z2  consists  of  non- descendants  of  X,  we  can  rewrite  P(z2lx)  as  P(z2)  us- 


mg  Rule  3.  Since  blocks  all  back-door  paths  from  X  io  Zx,  we  can  rewrite 
P{zx\x,z-i)  as  P{zx\x,Z2)  using  Rule  2.  The  entire  query  can  thus  be  rewritten 
^  ^zi,z2  ^x'  ^2,  X  )P{x'\z2)P{zi\x^  Z2)P(z2).  See  examples  in  Figure  5.2 


Figure  5.2:  Illustrating  Condition  4  of  Theorem  12.  (a),  blocks  all  directed 
paths  from  to  F,  and  the  empty  set  blocks  all  back-door  paths  from  Zi  to  F 
in  and  all  back-door  paths  from  X  to  in  G;  (b,c)  blocks  all  directed 
paths  from  X  to  Y ,  and  Z2  blocks  all  back-door  paths  from  Zi  to  F  in  G^  and 
all  back-door  paths  from  X  to  Zi  in  G 

We  now  prove  that  the  conditions  of  Theorem  12  are  necessary.  This  may  be 
shown  by  contradiction 

Proof  Sketch:  We  will  assume  that  there  exists  a  query  P{y\x)  and  a  graph 
G  such  that  (1)  None  of  the  conditions  of  Theorem  12  holds,  and  (2)  there  exists 
a  finite  sequence  of  applications  of  inference  rules  which  removes  all  hats  from 
the  variables  m  the  query.  We  will  show  that  these  two  assumptions  lead  to  a 
contradiction;  hence,  if  all  four  conditions  of  Theorem  12  fail,  there  must  not  be  a 
finite  sequence  of  inference  rules  that  reduces  the  query  to  a  hat-free  expression. 

Proof  Outline: 

I  (F  J|_  X|Z,  W)G-g^,  so  Rule  2  can  be  applied  to  remove  the  hat  from  X. 

A  There  is  a  directed  path  from  Z  to  V 


104 


1  Cannot  add  z  using  Rule  3 

2  Cannot  add  z  using  Rule  2 

B  There  is  a  directed  path  from  Z  io  X 

1  Cannot  remove  z  using  Rule  2 

2  Cannot  remove  z  using  Rule  3 

_lj_  ^)g-  y^w)''  3  can  be  applied  to  remove  x 

A  Cannot  add  z  using  Rule  3 
B  Cannot  add  z  using  Rule  2 

Assume  that  there  exists  a  query  P{y\x)  and  a  graph  G  such  that  none  of  the 
conditions  of  Theorem  12  holds,  but  the  query  is  still  identifiable.  Since  P{y\x)  is 
identifiable,  there  must  be  some  finite  sequence  of  inference  rules  that  removes  the 
hat  from  X .  This  means  that  there  must  be  some  (possibly  empty)  set  of  variables 
Z  and  W  such  that  either  (F  J_  X\Z,  W)g-^,  so  we  can  reduce  P(y\x,z,w)  to 
P{y\x,  z,  w)  via  Rule  2,  or  (F  Jj_  X\Z,  W)g-  _,  so  we  can  reduce  P{y\x,  z,  w) 
to  P{y\z,  w)  using  Rule  3.  We  will  look  at  each  of  these  cases  in  turn. 

Case  I:  Consider  (F  ^X\Z,W)g-^^.  By  assumption,  P{y\x)  is  identifi¬ 
able,  and  the  hat  is  removed  from  X  by  an  application  of  Rule  2.  This  implies 
a  series  of  rule  applications  to  R(j/|x)  which  results  in  R(?/|x,  ,S,  tu)  such  that 
(F  II  X\Z^W)g^^.  We  will  look  at  the  restrictions  on  imposed  Z  and  W  by 
both  the  failure  of  the  conditions  of  Theorem  12  to  hold  and  the  assumption  that 
P{y\x)  can  be  transformed  to  P[y\x,z,w)  by  a  series  of  rule  applications.  We 
will  also  make  the  assumption  that  Z  and  W  are  minimal.  If  they  are  not,  then 
there  exist  minimal  Z'  and  W\  in  which  superfluous  nodes  are  removed,  that 


105 


would  work.  Thus,  proving  that  no  minimal  Z'  and  W  exist  implies  that  no  Z 
and  W  exist. 

If  {¥  -Z^IZ,  W)gx,  then  a  blockable  back-door  path  would  exist,  and  Con¬ 
dition  3  of  Theorem  12  would  have  held.  We  also  know  (V  ||  XjZ,  W)g-  ,  by 
assumption.  These  two  independence  assertions  imply  that  Z  conducts  a  back¬ 
door  path  that  is  not  blocked  by  W.  That  is,  there  is  a  back-door  path  between 
X  and  Y  that  has  a  head-to-head  junction  in  Z.  Each  element  of  Z  must  also 
block  a  back-door  path  from  X  to  Y ,  since  Z  is  minimal.  This  implies  that  there 
is  a  directed  path  from  Z  to  X  ov  from  Z  to  Y  (Figure  5.3). 

Proof  that  there  is  a  directed  path  from  Z  to  ^  or  from  Z  to  Y:  Since  we 
know  that  Z  must  block  a  back-door  path  from  X  to  Y,  there  must  be  a  path 
from  Z  to  X  or  from  Z  to  Z  that  starts  in  an  arrow  that  is  incident  away  from 
Z.  All  of  the  head-to-head  junctions  along  this  path  must  either  be  in  W  or  have 
descendants  in  W.  If  there  are  no  such  head-to-head  junction  paths,  then  there  is 
a  directed  path  from  Z  to  or  from  Z  to  Y.  If  there  is  a  head-to-head  junction, 
then  consider  the  W  that  unblocks  this  junction.  This  W  must  itself  block  a 
back-door  path  from  X  to  Y ,  so,  there  must  be  a  path  from  W  to  either  X  or 
Y  that  starts  with  an  arc  incident  away  from  W.  This  path  is  either  a  directed 
path  from  W  to  X  or  from  W  to  Y,  or  has  a  head-to-head  junction  that  is  also 
a  member  of  W  or  an  ancestor  of  a  member  of  W.  Since  the  graph  is  acyclic, 
there  must  eventually  be  a  IF  that  has  a  directed  path  to  A"  or  F  that  is  also  a 
descendant  of  Z.  Thus  there  is  a  directed  path  from  Z  to  either  X  or  Y. 

We  now  look  at  the  cases  of  Case  I. 

Case  lA.  A  directed  path  exists  from  Z  to  Y.  By  our  assumption,  there 
must  be  a  sequence  of  rules  that  transforms  P(y\x)  to  P(y\x,z,w).  There  are 


106 


two  ways  of  adding  i  to  P{y\x)~hy  using  Rule  3,  or  by  first  conditioning  on  Z 
and  then  adding  the  hat  to  it  by  using  Rule  2. 

Case  lAl:  First  we  look  at  using  Rule  3.  If  there  is  a  directed  path  from 
Z  to  Y  (Figure  5.3a),  then  {Y  Z\X)g- No  element  of  W  can  block  this 
path  from  Y  to  Z,  since  that  would  require  kF  to  be  a  descendant  of  Z,  and 
(FJf_  ZjX,  W')g-.  So  Rule  3  cannot  be  invoked  to  add  z  to  P{y\x). 

Case  IA2:  We  need  to  condition  on  Z  and  then  add  the  hat  to  it  using  Rule  2. 
In  order  for  us  to  add  the  hat  to  Z  using  Rule  2,  there  needs  to  be  a  W  such  that 
{Y  _[j_  Z\W',X)g^^.  Above,  we  proved  that  given  our  assumptions,  there  must 
be  an  unblocked  path  from  F  to  A  that  has  a  head-to-head  junction  at  Z  and 
that  no  member  of  W  blocks  this  path,  so  W  %  W.  If  we  condition  on  a  W'  that 
allows  us  to  add  the  hat  to  Z,  we  must  then  remove  this  W  to  obtain  P{y\x,  i,  w) 
so  that  we  can  remove  the  hat  from  X.  However,  we  are  not  able  to  remove  this 
W'.  We  cannot  remove  W'  using  Rule  1,  since  (Y  [f  W'|A,  Z,  W)g-  -,  and  if  we 
add  some  W"  that  d-separates  Y  from  W,  then  we  would  not  be  able  to  remove 
W".  Thus,  we  cannot  add  i  to  P{y\x)  by  first  conditioning  on  Z  and  then  adding 
the  hat  to  it  by  using  Rule  2  if  there  is  a  directed  path  from  Z  to  Y. 

Case  IB.  A  directed  path  exists  from  Z  to  X.  If  there  is  a  directed  path 
from  Z  to  A  (Figure  5.3b),  we  can  assume  that  we  can  add  z  to  P(ylx)  to  get 
P(ylx,  z),  condition  on  W  to  get  P(yjx,  z,  w),  and  then  use  Rule  2  to  remove  the 
hat  from  A. 

Now  we' will  prove  that  there  is  no  way  to  remove  z  from  the  expression 
P(ylx,z,w).  Since  there  is  a  back-door  path  from  A  to  F  that  has  a  head-to- 
head  junction  at  Z,  there  must  be  a  back-door  path  from  Z  to  F. 


107 


Case  IBl:  If  we  could  remove  the  hat  from  Z  using  Rule  2,  then  we  could 
block  the  back-door  path  from  Z  to  K  and.  hence,  block  the  back-door  path  from 
X  to  V,  and  Condition  3  would  have  held. 

Case  IB2:  If  we  could  remove  i  directly  using  Rule  3,  then  there  would  have 
to  be  some  set  of  nodes  that  blocked  the  directed  path  from  Z  to  X,  and  both 
(Y  X\Z,  W)a-^  and  {Y  \jf  X\Z,  W)a^  would  not  be  true. 

Thus,  we  cannot  remove  all  the  hats  from  the  expression  by  using  Rule  2  to 
remove  the  hat  from  X. 


Figure  5.3:  Using  Rule  2  to  remove  the  hat  from  X  when  the  criterion  fails:  since 
Z  is  necessary,  there  must  be  a  directed  path  from  (a)  Z  to  F  or  (b)  Z  to  X 

Case  II:  Now  consider  (Y  J_X\Z,W)g- We  will  try  to  find  a  set 
of  rule  applications  that  transforms  P(y|x)  into  P(y\x,z,w)  when  none  of  the 
conditions  of  Theorem  12  holds.  Z  must  block  all  directed  paths  from  JiT  to  F. 
If  It  did  not,  then  W  would  have  to  block  a  directed  path,  which  would  make  W 
a  descendant  of  X,  so  then  X(W)  =  0,  and  thus  (F  _[[_  X|Z,  W)g--,  however,  we 
proved  above  that  this  could  not  happen  if  any  of  the  conditions  of  Theorem  12 
holds.  There  are  two  ways  to  of  adding  z  to  P(j/|^)-by  using  Rule  3  directly,  or 

by  conditioning  on  Z  and  then  adding  the  hat  to  it  using  Rule  2.  We  will  look 
at  each  of  these  in  turn. 


108 


Case  IIA:  First,  we  will  try  to  add  I  directly  by  using  Rule  3.  To  do  this, 
there  must  be  some  W  such  that  (F  JL  ZlW,X)o^  Since  there  is  a  directed 
path  from  Z  to  V,  W  must  be  a  descendant  of  Z  and  thus  (V  ||  ZjW,X)G- 
So,  W  blocks  all  back-door  paths  between  Z  and  V  in  Once  x  has  been 
removed  from  P(yjx,z,w)  to  obtain  F(!/l£,w),  we  need  to  remove  £,  or  remove 
the  hat  from  Z.  We  cannot  remove  the  hat  from  Z  directly  by  using  Rule  3,  since 
Z(]V)  =  0  and  thus  (F  Jf_  Z|W,  A)g_,  and  there  is  a  back  door  path  from 
^  to  F  through  X.  If  we  could  remove  the  hat  from  Z  by  using  Rule  2,  then 
Condition  4  would  hold.  So,  we  cannot  add  i  directly  by  using  Rule  3  if  any  of 
the  conditions  of  Theorem  12  holds. 

Case  IIB:  Next,  we  will  try  to  condition  on  Z  and  then  add  the  hat  to  it  by 
using  Rule  2.  However,  for  this  to  be  possible,  there  would  have  to  be  a  IF  that 
blocks  all  back  door-paths  between  X  and  Z,  and  between  Z  and  F— and  then 
Condition  4  would  hold. 

Thus,  if  none  of  the  conditions  of  Theorem  12  hold,  the  query  must  not  be 
identifiable. 

Remark:  The  criterion  in  Theorem  12  is  complete  only  if  the  inference  rules 
themselves  are  complete.  We  will  look  at  each  of  the  three  rules  in  Theorem  11 
and  show  that  the  graphical  conditions  that  license  each  are  the  tightest  possible. 


(F  II  Z|A,  IF)g_  if  P(ylx,z,w)  =  P(j/jx,z) 

Since  the  d-separation  condition  is  valid  for  any  recursive  model,  including 
the  submodel  represented  by  Gj^,  the  conditional  independence  P(yjx,z,w)  = 
P(!/lx,  z)  implies  (F  Z|A,  W)g-. 


109 


(y  jl_Z\X,W)a-^  if  P{y\x,z,w)  =  P{y\x,z,w) 

Consider  the  augmented  diagram  G'  that  has  the  intervention  arcs  Fz  ^  Z 
added.  u;)  =  z,  «;)  implies  that  (F  ||  Fz\X,Z,W)gl.-  If  there 

is  a  path  from  r  to  Z  that  is  unblocked  by  {X,  W}  in  Gy,  this  path  must  not 
end  in  an  arrow  incident  to  Z,  if  it  did  (Y  ||  Fz\X,Z,W)au  would  not  hold. 

X 

Since  every  path  from  Y  to  Z  that  is  not  blocked  by  {X,  W}  in  %  must  pass 
through  an  arrow  leaving  Z,  (Y  ||  Z\X,  W)g-  . 

{Y  ±_Z\X,W)g-^^  if  P{y\x,z,w)  =  P{y\x,w) 

Again  consider  G'  with  intervention  arcs  Fz  Z  added.  P{y\x,  z,w)  = 
P{y\i,w)  implies  that  (Y  _\\_Fz\X,W)g!_.  Hence,  any  path  from  Z  to  Y  that 
IS  not  blocked  by  {X,  W}  in  %  must  end  in  an  arrow  pointing  to  Z;  otherwise, 
(Y  JLFzlX,  W)g'_  would  not  hold.  In  addition,  if  there  is  a  path  from  some 
Z'  of  Z  to  F  that  does  end  in  an  arrow  pointing  to  Z',  then  W  must  not  be  a 
descendant  of  Z'-  otherwise,  (F  _^Fz\X,  W)gi_  would  not  hold.  Thus,  the  only 
paths  from  F  to  Z  must  end  in  an  arrow  pointing  at  Z  and  in  some  member  of 
Z(IF).  Thus,  (F  II  Z\X,W)g _ 

Although  these  rules  are  as  tight  as  possible,  some  strange  exchange  of  hatted 
and  hatless  variables  that  is  not  reachable  by  successive  applications  of  Rules  1-3 
might  still  be  licensed  by  some  graph.  Thus,  it  is  possible  that  the  three  inference 
rules  (and  hence  the  Theorem  12)  are  not  complete. 


no 


5.7  Remarks  on  Efficiency 

In  implementing  Theorem  12  as  a  systematic  method  for  determining  identifia- 
bihty ,  Conditions  3  and  4  would  seem  to  require  exhaustive  search.  To  prove  that 
Condition  3  does  not  hold,  for  instance,  we  need  to  prove  that  no  blocking  set  B 
can  exist.  Fortunately,  the  following  theorems  allow  us  to  significantly  prune  the 
search  space,  so  as  to  render  the  test  tractable. 

Theorem  13  If  for  one  minimal  set  Bi,  P{hi\x)  is  identifiable,  then  for  any 
other  minimal  set  Bj,  P{bj\x)  is  identifiable. 

Theorem  13  allows  us  to  test  Condition  3  with  a  single  minimal  blocking  set 
B.  If  B  meets  the  requirements  for  Condition  3,  then  the  query  is  identifiable; 

otherwise.  Condition  3  cannot  be  satisfied.  In  proving  this  theorem,  we  use  the 
following  lemma. 

Lemma  2  If  the  query  P{y\x)  is  identifiable,  and  a  set  of  nodes  Z  lies  on  a 
directed  path  from  X  to  Y,  then  the  query  P{z\x)  is  identifiable. 

Theorem  14  Let  Yi  and  Y2  be  two  subsets  of  nodes  such  that  either  no  nodes 
Yi  are  descendants  of  X,  or  all  nodes  Yi  and  Y2  are  descendants  of  X  and  all 
nodes  Yi  are  nondescendants  0/F2.  A  reducing  sequence  for  P{yy,,y2\x)  exists  (per 
Corollary  1)  iff  there  are  reducing  sequences  for  both  P{yi\x)  and  P{y2\x,yP). 

^2!®)  niay  possibly  pass  the  test  in  Theorem  12  if  we  apply  the  procedure 
to  both  P{y2\x,yi)  and  P{yi\x),  but  if  we  try  to  apply  the  test  to  P{yi\x,y2),  we 
will  not  find  a  reducing  sequence  of  rules.  Figure  5.4  shows  just  such  an  example. 
Theorem  14  guarantees  that,  if  there  is  a  reducing  sequence  for  P{yi,y2\x),  then 


111 


Figure  5.4:  Theorem  12  ensures  a  reducing  sequence  for  P{y2\x,yx)  and  P(2/i|x), 
although  none  exists  for  P{y\\x,y2) 

we  should  always  be  able  to  find  such  a  sequence  for  both  P{yx\x)  and  P(t/2|^,yi) 
by  proper  choice  of  Fi . 

Theorem  15  If  there  exists  a  set  Zi  that  meets  all  of  the  requirements  for  Zx 
in  Condition  4,  then  the  set  consisting  of  the  children  of  X  intersected  with  the 
ancestors  ofY  will  also  meet  all  of  the  requirements  for  Zx  in  Condition  f. 

Theorem  15  removes  the  need  to  search  for  Zx  in  Condition  4  of  Theorem  12. 
We  now  provide  proofs  for  Theorems  13-15 
Proof  (of  Theorem  13): 

(By  contradiction.) 

Assume  that  there  is  a  minimal  set  B  such  that  (Y  J_  X\B)g^  and  the  query 
P{h\x)  is  identifiable.  Assume  that  there  is  another  minimal  set  K  such  that 
{K  Jj_  X\B)g^  and  the  query  P{k\x)  is  not  identifiable. 

Consider  all  (undirected)  paths  from  to  K  in  G^.  Every  element  of  B  and 
K  must  lie  along  one  of  these  paths,  since  the  sets  are  minimal.  In  addition,  at 
least  one  member  of  K  must  be  a  descendant  of  X,  otherwise  P{k\x)  would  be 
identifiable.  In  fact,  any  member  of  K  that  is  a  descendant  of  X  needs  to  lie  on 


112 


a  directed  path  from  X  to  V.  To  see  that  this  is  true,  note  that  if  a  member 
Ki  of  is  a  descendant  of  X  but  does  not  lie  on  a  directed  path  from  X  to 
V ,  then  there  must  be  a  head-to-head  junction  along  the  path  from  Ki  to  Y. 
This  path  would  have  to  be  unblocked  by  some  other  member  K2  oi  K.  Since 
K  is  minimal,  there  must  be  some  unblocked  path  from  some  descendant  of  K2 
to  Y  that  K  blocks.  This  implies  that  there  is  either  a  directed  path  from  one 
of  the  descendants  oi  K2  to  Y,  which  would  make  Ki  an  ancestor  of  Y,  or  a 
head-to-head  junction  on  the  path  from  K2  to  Y  that  is  unblocked  by  some  other 
member  K^oiK.  Namely,  there  is  either  an  infinite  series  of  Ks  between  Ki  and 
F,  or  else  a  directed  path  from  Ki  to  Y  (see  Figure  5.5). 


Figure  5.5:  If  a  member  of  K  blocks  a  back-door  path  from  X  io  Y  and  is  a 
descendant  of  X,  then  it  is  also  an  ancestor  of  Y 

Let  K'  be  the  subset  of  K  that  lies  on  a  directed  path  from  X  to  F,  and  let 
K"  =  K\  K'.  We  know  that  P{k\x)  =  P{k'\x,  k")  *  P{k"\x)  and  that  P{k"\x)  = 
P{k  ).  So,  P{k'\x,k")  must  not  be  identifiable.  Since  K  is  minimal,  K'  must 
block  some  back-door  path,  and  that  back-door  path  must  also  be  blocked  by 
some  member  B'  of  B.  There  are  two  possibilities:  either  the  path  that  K' 
blocks  has  a  head-to-head  junction  that  is  not  unblocked  by  B  or  there  is  some 
member  B'  of  B  which  blocks  the  same  back-door  path.  These  two  cases  are 
illustrated  in  Figure  5.6. 


113 


Figure  5.6:  Examples  of  the  two  cases  for  K' 

Case  1:  There  is  a  head-to-head  junction  that  is  not  unblocked  in  B,  but  is 
unblocked  in  K.  Call  this  junction  J.  Since  K  is  minimal,  the  element  of  K  that 
unblocks  this  path  (equal  to  either  J  or  one  of  J’s  descendants)  must  lie  on  some 
unblocked  path  from  K  to  A”  in  G^.  If  this  is  the  case,  then  there  must  be  an 
unblocked  path  through  J’s  descendants  that  also  goes  through  J,  which  means 
there  must  be  some  element  B'  of  B  that  blocks  the  path  between  J  and  X  in 
(see  Figure  5.7).  We  can  condition  and  sum  over  this  B'  to  get 

P{k'\x,k")  =  Y.Pi^'\^^^"^^')*P{h'\x,k") 

6' 

=  Y.P{k'\x^k\b')*p{b'\x,k") 

b' 

by  using  Rule  2.  So,  the  query  P{b'\x,k")  must  not  be  identifiable.  Thus,  B' 
must  be  a  descendant  of  X,  because  otherwise  P{b'\x,  k")  =  P{b'\k").  So,  P{b'\x) 
is  identifiable,  but  P{b'\x^k")  is  not.  Therefore,  K"  must  disallow  the  blocking 
of  a  back-door  path  from  X  to  B'.  As  a  result,  there  must  be  a  back-door  path 
from  X  to  B'  that  has  a  head-to-head  junction,  and  this  junction  must  have  a 
descendant  in  K"  but  not  in  B.  This  is  impossible:  since  K  is  minimal,  the 
descendant  of  the  head-to-head  junction  must  block  a  back-door  path  from  X  to 


114 


X, 

rB^  ^  ^ 

1  j 

K’i 

r  * 

Y) 

1  K” 

Figure  5.7.  There  must  exist  a  member  B'  of  B  which  blocks  the  back-door  path 
from  X  to  J 

Y.  B  must  block  that  same  path,  meaning  the  path  from  X  to  B'  was  unblocked 
by  B  as  well  as  by  K". 

Case  2:  There  is  a  member  B'  of  B  that  blocks  the  same  back-door  path  as 
K'.  The  path  could  be  blocked  by  B'  bewteen  either  X  and  K' ,  or  between  K' 
and  Y  (See  Figure  5.8).  If  the  path  is  blocked  by  B’  between  X  and  K',  we  have 
the  same  contradiction  as  in  Case  1  above.  If  it  is  blocked  by  B'  between  K' 
and  Y ,  then  B  lies  on  a  directed  path  from  X  to  Y .  From  Lemma  2,  we  know 
that  P{k'\x)  must  be  identifiable.  That  means  that  K"  must  disallow  either 
Condition  3  or  Condition  4  of  the  Theorem  12.  If  it  blocks  Condition  3,  then  K" 
must  conduct  a  back-door  path  from  X  to  K' .  Namely,  some  member  of  K"  is 
at  or  is  a  descendant  of  a  head-to-head  junction  along  a  path  from  K'  to  X  in 
%.  Using  the  same  argument  as  above,  since  K  is  minimal,  the  path  blocked 
by  A  must  also  be  blocked  by  5,  and  thus  the  head-to-head  junction  must  be 
unblocked  by  B  as  well.  Any  unblockable  back-door  path  from  X  to  K'  will  also 
be  an  unblockable  back-door  path  from  X  to  B',  since  B'  is  a  direct  descendant 
of  K'.  However,  we  know  that  there  cannot  be  a  back-door  path  from  X  to  B' 
that  is  unblockable  when  we  condition  on  B.  Thus  there  cannot  be  a  back-door 
path  from  X  to  K'  that  is  unblockable  when  we  condition  on  K". 


115 


(a) 


X 


B’o 


\ 

/ 


Figure  5.8:  B  can  be  between  either  (a)  X  and  K',  or  (b)  K'  and  Y. 

If  K"  disallows  Condition  4,  then  some  other  set  of  nodes  R  must  block  every 
directed  path  from  X  to  K'.  K"  must  unblock  a  back-door  path  from  X  to  i?  or 
from  R  to  K'.  As  above,  if  a  back-door  path  from  X  to  R  (and  thus  from  X  to 
K')  is  unblocked  by  K",  it  will  also  be  unblocked  by  B.  So,  K"  must  unblock  a 
back-door  path  from  R  to  K'.  Since  K  is  minimal,  there  must  be  a  path  from  a 
descendant  of  i?  to  X  in  which  implies  that  there  must  also  be  a  path  from 
F  to  X  in  Gx  that  passes  through  R  and  K'.  Since  the  back-door  path  from  X 
to  K'  must  not  be  blockable  (since  a  blockable  back-door  path  was  invalidated 
above),  B  must  block  the  path  from  Y  to  K'.  But  then  there  would  not  be 
a  back-door  path  from  R  to  K'  that  is  blockable  when  conditioning  on  B  but 
unblockable  when  conditioning  on  K". 

So,  if  any  minimal  set  B  blocks  all  back-door  paths  from  X  to  Y  and  the 
query  P{h\x)  is  identifiable,  then  if  any  other  minimal  set  K  blocks  all  back-door 
paths  from  X  to  F,  P(A:|;r)  must  also  be  identifiable.  □ 

Proof  (of  Lemma  2); 

If  the  query  P{y\x)  is  identifiable,  one  of  the  four  conditions  of  Theorem  12  must 
have  been  satisfied.  We  Look  at  each  in  turn. 


116 


Condition  1:  If  there  is  no  path  from  F  to  in  then  there  cannot  be 
a  path  from  any  of  F’s  ancestors  to  X  in  since  any  path  from  X  io  Z  would 
be  part  of  a  path  from  X  to  Y . 

Condition  2:  If  there  is  no  directed  path  from  F  to  X,  then  there  cannot 
be  a  Z  that  lies  along  a  directed  path  from  F  to  X,  and  the  lemma  is  trivially 
true. 

Condition  3:  If  there  is  a  set  B  that  blocks  all  back-door  paths  from  X  to 
F,  then  any  back-door  path  from  X  to  Z  will  also  be  a  back-door  path  from  X 
to  F.  B  must  block  this  back-door  path  from  X  and  F.  If  B  blocks  the  path 
between  X  and  Z,  then  B  also  blocks  the  back-door  path  from  X  to  Z,  and  the 
query  P{z\x)  is  identifiable.  If  B  blocks  the  path  between  Z  and  F,  then  we  can 
use  the  fact  that  the  query  P(b\x)  must  be  identifiable.  If  P(b\x)  is  identifiable 
by  Condition  4,  then  P{z\x)  must  also  be  identifiable  by  Condition  4,  since  the 
variables  that  meet  the  specifications  for  Zi  in  condition  4  for  P{b\x)  will  also 
meet  the  specifications  for  Z^  in  Condition  4  for  P{z\x).  If  P{b\x)  is  identifiable 
by  Condition  3,  then  there  is  some  B'  that  blocks  the  back-door  path  from  X  to 
5;  it  must  be  either  between  X  and  Z,  in  which  case  Piz\x)  is  identifiable,  or 
it  must  be  between  Z  and  B\  Since  there  are  a  finite  number  of  links  between 
Z  and  F,  eventually  the  back-door  path  from  X  to  Z  must  be  blocked,  and  the 
query  P{z\x)  is  identifiable. 

Condition  4:  If  there  exists  a  set  Z\  and  Za,  Z  can  come  either  before  or 
after  Z\.  If  it  comes  after  Z\^  then  the  conditions  that  held  for  F  will  also  hold 
for  Z,  and  the  query  P{y\x)  will  be  identifiable.  If  it  comes  before  Zi,  then 
{Zi,  Za}  will  block  all  back-door  paths  from  X  to  Z,  and  the  query  will  also  be 
identifiable.  □ 


117 


Proof  (of  Theorem  14): 

(By  contradiction).  Let  and  V2  be  two  subsets  of  nodes  such  that  either  no 
nodes  Vi  are  descendants  of  or  all  nodes  Yi  and  Fj  are  descendants  of  X 
and  all  nodes  Yi  are  non  descendants  of  Y2.  Assume  that  there  exists  a  reducing 
sequence  for  both  P{y2\x)  and  P{yi\x,y2),  but  not  for  P(t/2|;r, j/i).  There  are 
three  possible  cases: 

Case  1:  F  and  Y2  are  both  nondescendants  of  X.  In  this  case,  P{yi\x,y2)  = 
^{yi\y2)  and  thus  is  identifiable. 

Case  2:  F  is  a  descendant  of  X,  but  Yi  is  not.  In  this  case,  Yi  must  unblock 
a  back-door  path  from  AT  to  F  which  cannot  be  blocked  by  conditioning  on  other 
variables.  But  if  this  is  the  case,  then  there  must  be  an  unblockable  back-door 
path  from  X  ioYx.  Since  Yi  is  a  descendant  of  A,  that  would  make  P{yi\x,y2) 
unidentifiable. 

Case  3:  Yx  and  F2  are  both  descendants  of  X.  Yx  cannot  unblock  a  back¬ 
door  path  from  X  to  F2  since  Yx  is  an  ancestor  of  F2.  Thus  P{y2\x)  must  be 
unidentifiable,  which  means  that  P(j/2|^)yi)  is  also  unidentifiable.  □ 

Proof  (of  Theorem  15): 

Assume  that  there  exists  some  set  Zx,  which  does  not  consist  entirely  of 
children  of  X,  such  that  Zx  blocks  all  directed  paths  from  X  to  F,  and  there 
also  exists  a  set  Z2  that  blocks  all  back-door  paths  from  X  to  Zx  in  G,  and  all 
back-door  paths  from  Zx  to  F  in  G-^-  Let  Z[  be  the  intersection  of  the  children 
of  X  with  the  ancestors  of  F.  clearly  blocks  all  directed  paths  from  X  to  F. 
Any  back-door  path  from  X  to  Z[  must  also  be  part  of  a  back-door  path  from  X 
to  some  member  of  Zi,  since  every  member  of  Zx  must  be  either  a  member  of  Z'x 
or  a  descendant  of  some  member  of  Since  Z2  consists  of  non-descendants  of 


118 


X ,  Z2  must  block  all  back-door  paths  from  X  to  Zi  between.  X  and  2^^  —  so  Z2 
also  blocks  all  back-door  paths  from  X  to  Z[.  Similarly,  all  back-door  paths  from 
Z{  to  Y  are  also  part  of  back-door  paths  from  2i  to  K,  which  are  also  blocked 
by  ^2-  □ 

5.8  Complexity  Analysis 

Using  the  results  of  Section  5.7,  we  can  show  that  the  identifiability  test  provided 
by  Theorem  12  can  be  implemented  in  polynomial  time.  We  will  show  that  each 
of  the  four  conditions  in  Theorem  12  can  be  tested  in  polynomial  time. 

1.  Since  d-separation  can  be  determined  in  time  0{V  +  E),  Condition  1  can 
be  tested  in  polynomial  time. 

2.  Again,  since  d-separation  can  be  determined  in  time  0(U  +  E),  Condition 
2  can  be  tested  in  polynomial  time. 

3.  Theorem  13  allows  us  to  test  a  single  minimal  blocking  set  to  determine 
whether  Condition  3  holds.  Thus,  we  need  to  find  a  minimal  blocking  set 
between  two  variables.  This  can  be  done  in  polynomial  time  as  follows. 

Function  B locking Set{X,  Y) 

Input:  Variables  X  and  Y 

Output:  Set  B  of  variables  that  block  all  back-door  paths  between  X  and 
Y 

(a)  Set  Ri  =  X  and  R2  =  pa;^ 

(b)  For  each  r  G  R2  that  has  a  confounding  (two-headed)  link  to  a  member 
of  Ri,  remove  r  from  R2,  add  r’s  parents  to  R2,  and  add  r  to  i?i 


119 


(c)  If  /?2  n  F  7^  0,  return  FAIL 

(d)  Set  R3  =  Y  and  R4  =  pay 

(e)  For  each  r  £  R4  that  has  a  confounding  (two-headed)  link  to  a  member 
of  Rs,  remove  r  from  R4,  add  r’s  parents  to  R4,  and  add  r  to  R3 

(f)  If  n  X  7^  0,  return  FAIL 

(g)  Set  B  =  R^U  R4 

(h)  If  {YJ_  X\B\  return  FAIL 

(i)  For  each  member  6  of  if  (F  _[j_  A:|B  \  6),  remove  b  from  B 

(j)  If  anything  was  removed  from  B  in  step  i,  go  to  step  i 

(k)  Return  B 

4.  To  test  Condition  4,  we  need  to  find  a  set  of  variables  Zi  and  Z2.  Theorem 
15  gives  us  a  constant-time  method  for  choosing  Zi.  To  find  Z2,  we  need 
only  to  find  a  blocking  set  that  is  not  a  descendant  of  X.  We  can  do  this 
by  labeling  the  descendants  of  X  ^^unobservable”  and  using  the  algorithm 
BlockingSet  to  find  a  minimal  blocking  set. 

5.9  Deriving  a  Closed-Form  Expression  for  Control 
Queries 

The  polynomial-time  algorithm  defined  by  Theorem  12  not  only  determines  the 
identifiability  of  a  control  query  but  it  also  provides  a  closed-form  expression  for 
the  value  P(y\x),  in  terms  of  the  observed  probability  distribution,  when  such  a 
closed  form  exists. 

Function  ClosedForm{P{y\x)) 


120 


Input:  Control  query  of  the  form  P{y\x) 

Output:  Either  a  closed-form  expression  for  P{y\x\  in  terms  of  observed 
variables  only,  or  FAIL  when  query  is  not  identifiable 

1.  If  (AT  Jj_  L')g_,  then  return  P{y) 

2.  Otherwise,  if  {X  _jj_  Y)gx,  then  return  P{y\x) 

3.  Otherwise,  let  B  =  Blocking Set{X,Y),  and  Ph  =  ClosedForm{b\x)-,  if 
Ph  ^  FAIL,  return  Ylb  Piy\b^  x)  *  Pb 

4.  Otherwise,  Let  Zi  =  Children{X)  0  (F  U  Ancestors{Y)),  Z3  = 

Blocking Set{X,  Zx),  Z^  =  Blocking Set{Zi,  F),  and  Z2  -  Z3UZ4;  if  F  ^  Zj 
and  X  ^  Z2,  return  P{y\z\,Z2,x')P{x'\z2)P{zx\x,Z2)P{z2) 

5.  Otherwise,  return  FAIL 

This  function  returns  either  a  closed-form  representation  of  the  value  of  the 
control  query  or,  when  the  query  is  not  identifiable,  FAIL. 

5.10  Conclusion 

In  this  chapter,  we  devised  a  polynomial-time  algorithm  for  determining  the  iden- 
tifiability  of  control  queries.  If  a  query  is  identifiable,  then  the  algorithm  gives 
a  closed-form  representation  of  the  value  of  the  control  query,  in  terms  of  the 
original  probability  distribution.  Thus,  we  have  a  tractable  method  for  assessing 
the  ramifications  of  actions,  given  a  qualitative  causal  diagram  together  with  a 
probability  distribution  on  a  set  of  observed  variables.  In  artificial  intelligence, 
the  primary  attraction  of  this  method  is  that  it  enables  one  agent  to  learn  to 


act  by  passively  observing  the  performance  of  another  acting  agent,  even  in  cases 
where  the  actions  of  the  other  agent  are  predicated  on  factors  that  are  not  visible 
to  the  learner.  If  the  learner  is  permitted  to  act  as  well  as  observe,  then  task 
becomes  much  easier:  the  topology  of  the  causal  graph  could  then  be  at  least  par¬ 
tially  inferred,  and  the  effects  of  some  previously  unidentifiable  actions  could  be 
determined.  Immediate  applications  to  cause-effect  analysis  of  nonexperimental 
data  in  the  social  and  medical  sciences  are  discussed  in  [Pea95a]. 


122 


CHAPTER  6 


Conclusion 

Causal  mechanisms  are  an  integral  component  in  day-to-day  to  reasoning.  If  we 
rely  solely  on  intuitive  notions  of  causality,  however,  we  can  easily  be  trapped  by 
fallacies  and  unsound  reasoning.  The  pitfalls  attendant  upon  the  use  of  vague 
notions  of  causality  have  caused  many  scientists  to  give  causal  language  a  wide 
berth.  We  can  avoid  these  pitfalls  by  giving  mathematicaUy  rigorous  definitions 
to  causal  notions.  In  this  thesis,  we  formulate  such  definitions  in  the  language  of 
structural  causal  models.  Once  we  have  a  formal  basis  for  causal  notions,,  we  can 
examine  their  properties,  and  devise  procedures  for  inferring  one  property  from 
another.  For  instance,  by  comparing  the  set  of  axioms  that  arise  from  our  defi¬ 
nition  of  causal  irrelevance  to  the  set  that  governs  path-interception  in  directed 
graphs,  we  can  both  inform  our  intuition  and  track  how  closely  the  workings  of 
causality  in  our  formal  system  match  our  understanding  of  the  physical  world. 

The  formal  system  of  structural  causal  models  has  two  major  advantages. 
For  one,  it  offers  researchers  and  policy  analysts  a  unified  and  unambiguous 
language  in  which  to  describe  world  models.  Because  causal  models  force  all 
assumptions  of  each  party  to  be  made  explicit,  and  the  ramifications  of  each 
model  can  be  computed  easily,  it  is  possible  for  researchers  to  discover  exactly 
how  two  competing  world  models  disagree  and  to  judge  which  is  more  applicable 
to  a  given  situation,  or  if  yet  another  model  is  more  appropriate.  In  some  cases. 


123 


differing  models  can  be  combined,  and  the  relative  strength  of  various  connections 
can  be  determined  from  data. 

The  other  advantage  is  that  formal  systems  such  as  structural  models  are 
conducive  to  mechanical  computation.  Thus,  we  can  create  algorithms  to  do 
reasoning  and  answer  queries  about  causation  using  standard  digital  computers. 
This  thesis  has  presented  one  such  algorithm,  which  determines  the  causal  effect 
of  one  variable  on  another  from  data  obtained  under  uncontrolled  conditions. 

Structural  causal  models  offer  a  mathematically  rigorous  means  for  specifying 
and  examining  our  intuitive  notions  of  causality,  an  unambiguous  language  de¬ 
scribing  knowledge  about  cause-effect  relationships,  and  a  computational  device 
for  answering  causal  queries. 


124 


APPENDIX  A 


Counterexamples 

2.2.3  (XW  VjZjp  =5-  (X  YIZ)p  V  (X /,  WIZ)p. 


X  =  til 

V  =  {X,  W,  Y}  binary 

w  =  (x  k  Ui) 

w 

w 

1/  =  {1/1}  binary 

y  =  Parity  {x,w,ui) 

Y 

P{ui)  =  0.5 

Figure  A.l:  Counterexample  to  property  2.2.3 


In  the  causal  model  of  Figure  A.l,  we  can  see  that  [XW  F|0)p  & 
^{X  74  W\^)p  k  ^{X  74  F|0)p. 

In  this  counterexample,  changing  X  can  affect  the  probability  of  F,  and 
changing  X  can  affect  the  probability  of  W,  but  changing  X  and  W  together 
cannot  affect  the  probability  of  Y.  Since  changing  X  affects  the  value  of 
IF,  it  makes  sense  to  think  that  intervening  on  IF  while  intervening  on  X 
would  not  interfere  with  the  effect  that  X  has  on  Y.  However,  X  does  not 
completely  control  W.  That  is,  when  we  only  intervene  on  X,  Ui  still  has 


125 


some  effect  on  W.  Controlling  both  X  and  Y  removes  the  influence  of  Ui  on 
W.  As  in  the  counterexample  to  property  2.2.2,  removing  the  connection 
between  Ui  and  W  prevents  X  from  having  an  effect  on  Y. 

2.2.4  {XW  Y\Z)p  k  {XY  />  W\Z)p  (AT  Y\Z)p  V  (AT  W\Z)p. 


In  Figure  A. 2,  we  can  see  that 
P{w)  =  P{y)  =  0.5; 

P{w\set{X  =  1))  =  P{y\set{X  =  1))  =  0.75; 

P{w\x,  y)  =  0.5  for  all  values  of  x,  y;  and 
P{y\x,w)  =  0.5  for  all  values  of  x,w 

Thus,  {XW  F|0)p  k  {XY  tA  W\^)p  k  -((AT  F|0)p  V  (X  W\^)p). 

This  counterexample  actually  contains  two  causal  models,  each  similar  to 
the  causal  model  of  the  counterexample  to  property  2.2.2.  In  one,  VF  is  a 
function  of  X,Y,  and  Ui,  and  F  is  a  function  of  U\.  As  in  the  counterex¬ 
ample  to  property  2.2.2,  X  can  affect  W  when  Y  has  the  same  value  as  C/2, 


126 


but  X  has  no  effect  on  F(w)  when  Y  is  held  constant.  In  the  other,  W 
is  a  function  of  Ui,  and  F  is  a  function  of  X,  W,  and  Ui.  Also  as  in  the 
counterexample  to  property  2.2.2,  X  can  affect  Y  when  W  has  the  same 
value  as  Ui,  but  X  has  no  effect  on  P{w)  when  W  is  fixed.  U2  determines 
which  model  is  in  effect  at  any  given  time.  While  intervening  only  on  X 
can  affect  P{w)  and  P{y),  simultaneously  changing  X  and  Y  has  no  effect 
on  P{w),  and  simultaneously  changing  X  and  W  has  no  effect  on  P{y). 

2.3  {X  74  WY\Z)p  =>(Xy^  Y\ZW)p. 


Figure  A. 3:  Counterexample  to  property  2.3 

In  the  causal  model  of  Figure  A.3,  {X  YW\^)p  k  -^{X  Y\W)p. 

In  this  counterexample,  X  does  not  have  any  effect  on  Y  since  P{y)  =  0 
and  X  can  only  act  as  an  inhibitor  of  Y.  When  we  intervene  on  W,  then  it 
is  possible  for  Y  to  have  the  value  1,  and  X  can  affect  the  probability  of  Y. 
Thus,  X  can  only  affect  Y  when  we  intervene  on  W,  and  X  has  no  effect 
on  W. 

2.4  {X  Y\Z)p  &  (JA  74  W\ZY)p  (X  ■/*  WY\Z)p. 


127 


X  =  Ui 

V'  U2 

1  // 

V  =  {X,W,Y}  binary 

y  =  U2 

\W 

U  =  {U\,U2]  binary 

w  =  Parity  {x,y,U2) 

W 

P{ui)  =  P{u2)  =  0.5 

Figure  A. 4:  Counterexample  to  property  2.4 

In  the  causal  model  of  Figure  A.4,  {X  F|0)p  k  {X  ^  W\Y)p  k 
^(X  -h  WY\^)p. 

Changing  X  can  affect  P{w)  (and  hence  P{y,  u;))  when  Y  is  not  held  fixed, 
and  changing  X  has  no  effect  on  P{y),  but  fixing  Y  blocks  the  effect  that 
X  has  on  W. 

2.5.1  {X  />  Y\ZW)p  &  (X  74  W\ZY)p  =4>  (X  WY\Z)p. 

In  the  causal  model  of  Figure  A. 5,  {X  Y\W)p  k  {X  W\Y)p  k 

-(a:  wY\^)p. 

Fixing  W  prevents  X  from  altering  the  probability  of  F,  and  fixing  Y 
prevents  X  from  altering  the  probability  of  VF,  but  X  can  change  the 
probability  of  W  (and  hence  the  probability  oi  W  k  Y)  if  there  is  no 
intervention  on  Y . 

Up  to  this  point,  all  of  the  counterexamples  have  relied  on  some  exogenous 
variable  from  U  having  two  different  children  in  V .  Obviously,  this  is  not  essential. 


128 


X  ^  Ui 

V’  U2 

1  // 

V  =  {X,  W,Y}  binary 

y  =  U2 

\W 

U  =  {111,1/2}  binary 

w  =  Parity  {x,y,U2) 

W 

P{ui)  =  P{u2)  =  0.5 

Figure  A. 5:  Counterexample  to  property  2.5.1 

since  we  could  always  create  similar  examples  in  which  each  exogenous  variable 
has  exactly  one  child.  For  instance,  in  the  causal  model  of  Figure  A. 5,  we  can 
replace  U2  with  Z  to  get  the  model  of  Figure  A.6. 

In  this  model,  all  of  the  exogenous  variables  U  have  exactly  one  child,  yet 
property  2.5.1  still  does  not  hold.  There  is  still  an  undirected  cycle  in  the  un¬ 
derlying  causal  graph,  which  is  required  for  property  2.5.1  to  be  false.  Properties 
2.2. 1-2.6  are  all  true  for  all  causal  models  whose  causal  graphs  are  trees.  In  ad¬ 
dition,  properties  2.2.1-2.5.2  are  true  for  all  causal  models  whose  causal  graphs 
are  polytrees.  Property  2.6,  as  we  will  see  now,  is  not  always  true,  even  when  we 
restrict  its  causal  graph  to  be  a  polytree. 

2.6  {X  74  Y\Z)p  (a  74  Z\Y)p  sj  {X a\Z)p  Vo  ^  AT  U  Z  U  F. 

In  the  causal  model  of  Figure  A.7,  {X -^Y\%)p  k  -'{W  ^Y\$)p  k 
^{X  74  lF|0)p  kW  ^X\JZKJY. 

X  can  only  cause  a  minor  change  in  W,  while  a  large  change  in  W  is  required 
to  affect  Y .  Thus,  X  can  affect  W,  and  W  can  affect  Y,  but  X  has  no  effect 


129 


X  =:  Ui 


V={X,W,Y,Z}  binary 
U  =  {U\^U2}  binary 

P(«i)  =  P{u2)  =  0.5 


Figure  A.6:  Counterexample  to  2.5.1,  such  that  each  variable  in  U  has  a  single 
child 

on  W.  Even  if  we  restrict  all  variables  to  be  binary,  transitivity  will  not 
hold.  For  this  counterexample,  W  could  be  split  into  four  binary  variables 
Wi, . . . ,  VF4,  such  that  =  -(x  V  Ui),  k  /v^3  =-‘X  k  M2, 

fw^  ~  ^  ^  '^2^  fy  —  V  t/;4.  In  Section  4.4.3,  we  elaborate  this  case. 


y  =  U2 


w  =  Parity{x,y,z) 


I  /i 

X  Y 


Z  =  U2 


130 


V  =  {X,W,Y}, 

x,t/e{o,i}, 

1»  €  {0.1, 2, 3} 

U={U„U,) 

^2  G  {0, 1} 


X  =  Ui 


W  =  X  +  2  *U2 


i' 

\i' 

w 

Y 


?/  =  (ry  >  1) 


P{ui  =  1)  =  P{u2  =  1)  =  0.5 


Figure  A. 7:  Counterexample  to  property  2.6. 


3 


References 


[BaI95]  A.  Bailee.  Probabilistic  counterf actuals:  Semantics,  computation,  and 
applications.  PhD  thesis,  Computer  Science  Department,  University 
of  California,  Los  Angeles,  1995. 

[BP94]  A.  Balke  and  J.  Pearl.  “Counterfactual  probabilities:  Computa¬ 
tion  methods,  bounds,  and  applications.”  In  R.L.  de  Mantaras  and 
D.  Poole,  editors.  Proceedings  of  the  Tenth  Conference  on  Uncertainty 
in  Artificial  Intelligence,  pp.  11-18,  San  Francisco,  1994.  Morgan  Kauf- 
mann. 

[BP95]  A.  Balke  and  J.  Pearl.  “Counterfactuals  and  policy  analysis  in  struc¬ 
tural  models.”  In  P.  Besnard  and  S.  Hanks,  editors,  Proceedings  of 
the  Eleventh  Conference  on  Uncertainty  in  Artificial  Intelligence,  pp. 
11-18,  San  Francisco,  1995.  Morgan  Kaufmann. 

[Car89]  N.  Cartwright.  Nature’s  Capacities  and  Their  Measurement.  Claren¬ 
don  Press,  Oxford,  England,  1989. 

[Daw79]  A.P.  Dawid.  “Conditional  independence  in  statistical  theory.”  Journal 
of  the  Royal  Statistical  Society,  Series  A,  41:1-31,  1979. 

[Eel91]  E.  Eells.  Probabilistic  Causality.  Cambridge  University  Press,  Cam¬ 
bridge,  England,  1991. 

[FHM95]  R.  Fagin,  J.  M.  Halpert,  Y.  Moses,  and  M.Y.  Vardi.  Reasoning  About 
Knowledge.  The  MIT  Press,  Cambridge,  Massachusetts,  1995. 

[Fis66]  F.M.  Fisher.  The  Identification  Problem  in  Econometrics.  McGraw- 
Hill,  New  York,  1966. 

[Fis70]  F.M.  Fisher.  “A  correspondence  principle  for  simultaneous  equation 
models.”  Econometrica,  38:73-92,  1970. 

[FN72]  R.E.  Fikes  and  N.J.  Nilsson.  “STRIPS:  A  new  approach  to  the  appli¬ 
cation  of  theorem  proving  to  problem  solving.”  Artificial  Intelligence, 
3:251-284,  1972. 

[Fre87]  D.  Freedman.  “As  others  see  us:  A  case  study  in  path  analysis.” 

Journal  of  Educational  Statistics,  12:101-223,  1987.  [with  discussion]. 


132 


[GH81]  A.  Gibbard  and  L.  Harper.  “Counterfactuals  and  two  kinds  of  ex¬ 
pected  utility.”  In  W.L.  Harper,  R.  Stalnaker,  and  G.  Pearce,  editors, 
Ifs.  D.  Reidel,  Dordrecht:  Holland,  1981. 

[Gol73]  A.S.  Goldberger.  Structural  Equation  Models  in  the  Social  Sciences. 
Seminar  Press,  New  York,  1973. 

[Gol92]  Arthur  S.  Goldberger.  “Models  of  substance  [comment  on  N.  Wer- 
muth,  ‘On  block-recursive  linear  regression  equations’].”  Brazilian 
Journal  of  Probability  and  Statistics,  6:1-56,  1992. 

[Goo61]  I.J.  Good.  “A  causal  calculus.”  Philosophy  of  Science,  11:305-318, 
1961. 

[GP92]  M.  Goldszmidt  and  J.  Pearl.  “Rank-based  systems:  A  simple  ap¬ 
proach  to  belief  revision,  belief  update,  and  reasoning  about  evidence 
and  actions.”  In  B.  Nebel,  C.  Rich,  and  W.  Swartout,  editors.  Pro¬ 
ceedings  of  the  Third  fnternational  Conference  on  Knowledge  Repre¬ 
sentation  and  Reasoning,  pp.  661-672,  San  Mateo,  CA,  October  1992. 
Morgan  Kaufmann  Publishers. 

[GP95]  D.  Galles  and  J.  Pearl.  “Testing  identifiability  of  causal  effects.”  In 
P.  Besnard  and  S.  Hanks,  editors.  Proceedings  of  the  Eleventh  Confer¬ 
ence  on  Uncertainty  in  Artificial  Intelligence,  pp.  185-195,  San  Fran¬ 
cisco,  1995.  Morgan  Kaufmann. 

[GP97]  D.  Galles  and  J.  Pearl.  “An  axiomatic  characterization  of  causal  coun¬ 
terfactuals.”  Technical  Report  R-250,  Computer  Science  Department, 
University  of  California,  Los  Angeles,  1997. 

[GVP90a]  D.  Geiger,  T.S.  Verma,  and  J.  Pearl.  “Identifying  independence  in 
Bayesian  networks.”  Networks,  20:507-534,  1990. 

[GVP90b]  D.  Geiger,  T.S.  Verma,  and  J.  Pearl.  “Identifying  Independence  in 
Bayesian  Networks.”  In  Networks,  volume  20,  pp.  507-534.  John  Wi¬ 
ley,  Sussex,  England,  1990. 

[Haa43]  T.  Haavelmo.  “The  statistical  implications  of  a  system  of  simultaneous 
equations.”  Econometrica,  11:1-12,  1943. 

[Hal97]  J.  Halpern.  “Axiomatizing  Causal  Structures.”  unpublished  report, 
Cornell  University,  May  1997. 


133 


[HM81]  R.A.  Howard  and  J.E.  Matheson.  “Influence  Diagrams.”  Principles 
and  Applications  of  Decision  Analysis,  Strategic  Decisions  Group, 
1981. 

[H0I86]  P.W.  Holland.  “Statistics  and  Causal  Inference  [with  discussion].” 
Journal  of  the  American  Statistical  Association,  81:945-970,  1986. 

[HS94]  D.  Heckerman  and  R.  Shachter.  “A  decision-based  view  of  causal¬ 
ity.”  In  R.  Lopez  de  Mantaras  and  D.  Poole,  editors.  Proceedings 
of  the  Tenth  Conference  on  Uncertainty  in  Artificial  Intelligence,  San 
Francisco,  1994.  Morgan  Kaufmann. 

[HS95]  D.  Heckerman  and  R.  Shachter.  “A  definition  and  graphical  represen¬ 
tation  of  causality.”  In  P.  Besnard  and  S.  Hanks,  editors.  Proceedings 
of  the  Eleventh  Conference  on  Uncertainty  in  Artificial  Intelligence, 
pp.  262-273,  San  Francisco,  1995.  Morgan  Kaufmann. 

[IS86]  Y.  Iwasaki  and  H.A.  Simon.  “Causality  in  device  behavior.”  Artificial 
Intelligence,  29(l):3-32,  1986. 

[KR51]  T.C.  Koopmans  and  0.  Reierspl.  “The  identification  of  structural 
characteristics.”  Annals  of  Mathematical  Statistics,  21:165-180,  1951. 

[Lea85]  E.  Learner.  “Vector  autoregression  for  causal  inference?”  Carnegie- 
Rochester  Conference  Series  on  Public  Policy,  22:255-304,  1985. 

[Lew73a]  D.  Lewis.  “Causation.”  Journal  of  Philosophy,  70:556-567,  1973. 

[Lew73b]  D.  Lewis.  Counterfactuals.  Harvard  University  Press,  Cambridge, 
MA,  1973. 

[Lew81]  D.  Lewis.  “Counterfactuals  and  comparative  possibility.”  In  W.L. 

Harper,  R.  Stalnaker,  and  G.  Pearce,  editors,  Ifs.  D.  Reidel,  Dordrecht, 
Holland,  1981. 

[Man90]  C.F.  Manski.  “Nonparametric  bounds  on  treatment  effects.”  Ameri¬ 
can  Economic  Review,  Papers  and  Proceedings,  80:319-323,  1990. 

[Med69]  J.S.  Meditch.  Stochastic  Optimal  Linear  Estimation  and  Control 
McGraw-Hill,  New  York,  1969. 

[Ney23]  J.  Neyman.  “On  the  application  to  probability  theory  to  agricultural 
experiments.”  Statistical  Science,  2:465-480,  1923.  Transl.  (1990) 
from  Essay  on  Principles,  Section  9. 


134 


[Pea88] 

[Pea93] 

[Pea94] 

[Pea95a] 

[Pea95b] 

[Pea96a] 

[Pea96b] 

[PP87] 

[PP94] 

[PPU96] 

[PR95] 


J.  Pearl.  Probabilistic  Reasonmg  in  Intelligent  Systems.  Morgan 
Kaufmann,  San  Mateo,  CA,  1988.  (Revised  4th  printing,  1997). 

J.  Pearl.  “Graphical  models,  causality,  and  intervention.”  Statistical 
Science,  8(3):266-273,  1993. 

J.  Pearl.  “A  probabilistic  calculus  of  actions.”  In  R.L.  de  Mantaras 
and  D.  Poole,  editors.  Proceedings  of  the  Tenth  Conference  on  Un¬ 
certainty  in  Artificial  Intelligence,  pp.  454-462,  San  Francisco,  1994. 
Morgan  Kaufmann. 

J.  Pearl.  “Causal  Diagrams  for  Empirical  Research  [with  discussion].” 
Biometrika,  82(4):669-709,  1995. 

J.  Pearl.  “On  the  Testability  of  Causal  Models  with  Latent  and  Instru¬ 
mental  Variables.”  In  D.  Besnard  and  S.  Hanks,  editors.  Proceedings 
of  the  Eleventh  Conference  on  Uncertainty  in  Artificial  Intelligence, 
pp.  435-443,  San  Francisco,  1995.  Morgan  Kaufmann. 

J.  Pearl.  “Causation,  action,  and  counterfactuals.”  In  R.  Fagin,  edi¬ 
tor,  Proceedings  of  the  Sixth  Conference  Theoretical  Aspects  of  Rea¬ 
soning  about  Knowledge:  (TARK  1996),  pp.  51—73,  San  Francisco, 
1996.  Morgan  Kaufmann. 

J.  Pearl.  “Structural  and  probabilistic  causality.”  Psychology  of 
Learning  and  Motivation,  34:393-435,  1996. 

J.  Pearl  and  A.  Paz.  “Graphoids:  A  graph-based  logic  for  reasoning 
about  relevance  relations.”  In  B.  du  Boulay,  D.  Hogg,  and  L.  Steels, 
editors.  Advances  in  Artificial  Intelligence-II,  pp.  357-363.  North- 
Holland,  Amsterdam,  1987. 

A.  Paz  and  J.  Pearl.  “Axiomatic  characterization  of  directed  graphs.” 
Technical  Report  R-234,  Computer  Science  Department,  University  of 
California,  Los  Angeles,  1994. 

A.  Paz,  J.  Pearl,  and  S.  Ur.  “A  new  characterization  of  graphs  based 
on  interception  relations.”  Journal  of  Graph  Theory,  22(2):125-136, 
1996. 

J.  Pearl  and  J.  Robins.  “Probabilistic  evaluation  of  sequential  plans 
from  causal  models  with  hidden  variables.”  In  P.  Besnard  and 


135 


S.  Hanks,  editors,  Proceedings  of  the  Eleventh  Conference  on  Un¬ 
certainty  in  Artificial  Intelligence,  pp.  444-453,  San  Francisco,  1995. 
Morgan  Kaufmann. 

[PV91]  J.  Pearl  and  T.  Verma.  “A  theory  of  inferred  causation.”  In  J.A. 

Allen,  R.  Fikes,  and  E.  Sandewall,  editors,  Principles  of  Knowledge 
Representation  and  Reasoning:  Proceedings  of  the  2nd  International 
Conference,  pp.  441-452,  San  Mateo,  CA,  1991.  Morgan  Kaufmann. 
Also  in  D.  Prawitz,  B.  Skyrms  and  D.  Westertahl  (Eds.),  Logic, 
Methodology  and  Philosophy  of  Science  IX,  Elsevier  Science  B.V.,  789- 
811,  1994. 

[Rob86a]  F.  Robert.  Discrete  Iterations,  A  Metric  Study.  Springer- Verlag, 
Berlin,  Germany,  1986.  Trans.  J.  Rokne. 

[Rob86b]  J.  Robins.  “A  new  approach  to  causal  inference  in  mortality  stud¬ 
ies  with  a  sustained  exposure  period  -  applications  to  control  of  the 
healthy  workers  survivor  effect.”  Mathematical  Modeling,  7:1393-512, 
1986. 

[Rob87]  J.  Robins.  “Addendum  to  ‘A  new  approach  to  causal  inference  in  mor¬ 
tality  studies  with  sustained  exposure  periods — application  to  control 
of  the  healthy  worker  survivor  effect’.”  Computers  and  Mathematics, 
with  Applications.,  14:923-45,  1987. 

[RR83]  P.  Rosenbaum  and  D.  Rubin.  “The  central  role  of  propensity  score  in 
observational  studies  for  causal  effects.”  Biometrika,  70:41-55,  1983. 

[Rub74]  D.B.  Rubin.  “Estimating  causal  effects  of  treatments  in  random¬ 
ized  and  nonrandomized  studies.”  Journal  of  Educational  Psychology, 
66:688-701,  1974. 

[Sal84]  W.  Salmon.  Scientific  Explanation  and  the  Causal  Structure  of  the 
World.  Princeton  University  Press,  Princeton,  1984. 

[Sav54]  L.J.  Savage.  The  Foundations  of  Statistics,  volume  1.  John  Wiley, 
New  York,  1954. 

[SGS93]  P.  Spirtes,  C.  Glymour,  and  R.  Schienes.  Causation,  Prediction,  and 
Search.  Springer- Verlag,  New  York,  1993. 

[Sha96]  G.  Shafer.  The  Art  of  Causal  Con'ecture.  MIT  Press,  Cambridge, 
MA,  1996. 


136 


[Sim70]  H.A.  Simon.  “Causal  ordering  and  identifiability.”  In  W.C.  Hood  and 
T.C.  Koopmans,  editors,  Studies  in  Econometric  Method,  pp.  49-74. 
Yale  University  Press,  New  York,  [1953]  1970. 

[Sob90]  M.E.  Sobel.  “Effect  analysis  and  causation  in  linear  structural  equa¬ 
tion  models.”  Psychometrika,  55:495-515,  1990. 

[SpoSO]  W.  Spohn.  “Stochastic  independence,  causal  independence,  and 
shieldability.”  Journal  of  Philosophical  Logic,  9:73-99,  1980. 

[Stu92]  M.  Studeny.  “Conditional  independence  relations  have  no  complete 
characterization.”  In  Information  Theory,  Statistical  Decision  Func¬ 
tions,  Random  Processes:  Transactions  of  the  Eleventh  Prague  Con¬ 
ference,  1990,  pp.  377-396,  Dordrecht,  Holland,  1992.  Kluwer  Aca¬ 
demic. 

[Sup 70]  P.  Suppes.  A  Probabilistic  Theory  of  Causation.  North-Holland,  Am¬ 
sterdam,  1970.  ’ 

[SW60]  R.H.  Strotz  and  O.A.  Wold.  “Recursive  versus  nonrecursive  systems: 
An  attempt  at  synthesis.”  Econometrica,  28:417-427,  1960. 

[Wer92]  N.  Wermuth.  “On  block-recursive  linear  regression  equations.”  Brazil¬ 
ian  Journal  of  Probability  and  Statistics,  6:1-56,  1992. 


</•< 


...I 


’.vA 


' 

O 

V''  ^ 

Oa  ^ 


137 


